What is the reason for networking problems in ADF "managed vnet"?

David Beavon 976 Reputation points
2023-10-13T19:30:45.29+00:00

I have used "managed vnets" in a few different scenarios. On one hand, there is a managed vnet in Power BI, used by the "Power BI Managed VNET Gateway". They say their technology originated from within ADF, which uses managed vnets as well. Within ADF they are used for integration runtimes.

Another new place I have used the "managed vnet" technology is in Synapse Analytics. In this context the technology is used for the sake of hosting VM's that participate in a spark cluster.

In all the three scenarios where I've been exposed to Microsoft "managed vnets", they have been extremely buggy and unreliable. For example, the bugs in Power BI Managed VNET Gateway cause refresh operations to fail on a recurring basis (about 50% chance of failure each day). The related support ticket has been open for two years. And they are definitely blaming the gateway failures on the "network team". Similarly in ADF these network bugs in the VNET will force us to enable retries to repeat operations over and over until they succeed (thereby incurring a large cost to our subscription).

In Synapse Analytics the buggy VNET causes failures as well. The failures take place in my own custom code (Spark executors) and Microsoft components. The Microsoft Livy Jobs Service (a component of Synapse itself) is very unreliable. Below is a screenshot of my Spark jobs from two days ago. Notice the failures labeled #1 and #2.

User's image

The section labeled #1 are failures in my own custom code. The failures are network exceptions, consisting of disconnections with a message such as "connection reset by peer". They might be related to disconnections from a remote HTTP service or SQL database or identity server.

The section labeled #2 are failures of Synapse to submit new jobs. These are self-contained failures that have nothing to do with my custom code. Basically the Spark PG is relying on an "LSR" operation that is serviced by the ADF PG. This section of the failures does not involve any of my own custom code. It is a problem where Microsoft components are failing because of the unreliable "managed vnet".

Can someone please help me understand this "managed vnet" technology? What is the reason for all the reliability problems, in the three different places where I have seen it being used used? Is it failing under high load? Are there software components involved, which might be starved for CPU?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
{count} votes

1 answer

Sort by: Most helpful
  1. David Beavon 976 Reputation points
    2023-11-15T23:33:23.17+00:00

    CAVEAT: This is not intended to be extremely reliable information and I cannot give a source for now. Anyone finding this information should keep in mind that the context is related to the Power BI managed VNET gateway. I don't have any certainty that this information extends to other platforms that use "managed vnets" (eg ADF and Synapse).

    On the Power BI Gateway side of things, I have confirmation of two changes coming in 2024 that might make an improvement. See below. These changes are for the "Azure Network stack"

    • Feature ending with x509 : Support map space deletion in VFP without breaking flows.

    (virtual filtering platform, ie switch?)

    The ETA publicized is late 2024

    • Bug ending with x810 : PE-NC Flow mix up

    Network team promised to roll it out early 2024.

    ... Please note that the Power BI Managed VNET Gateway is still in preview, so it is reasonable that there are ongoing changes that would improve reliability. (The only unreasonable part is waiting 3 or 4 years for the GA.)

    In contrast to Power BI, I am not having as much luck with some other PG teams. I'm still waiting on a couple other PG teams to explain their managed vnet problems as well (eg. ADF and Synapse). This may take a while. I have made zero progress in the past month, despite spending dozens of hours with the ADF/Synapse organizations at CSS. Any information that I am able to gather by way of CSS is very hard to come by! CSS is quite different than unified support, to put it lightly. Your Azure Account Manager will be very eager to point you to one of these, but not the other.