Self Hosted Integration Runtime consumes all memory

Anonymous
2023-07-04T11:08:08.0366667+00:00

We use Azure Synapse to load JSON files from our Gen2 data lake into a dedicated SQL pool. The section of the JSON files containing the data we are interested in is copied into an NVARCHAR(MAX) column as-is after which we use SQL stored procedures to normalize the data into tables.

We have 3 specific integration pipelines where the above process fails when the data volumes get relatively large. In this case we have received an initial load for a specific object containing about 15000 JSON files averaging about 1mb each. The total data size is about 16.5GB. The copy activity we use to copy the files to the dedicated pool seems to start fine but after some time the throughput starts to drop and eventually no files are processed anymore. Eventually the process will fail, usually with some kind of timeout of out of memory error.

User's image

We initially ran this on an autoresolve integration runtime and we have already looked at this issue previously with consultants from Microsoft. Reducing the number of parallel copies and expanding the max available memory helped the last time, but this time round I cannot seem to get the data processed.

Thinking a self-hosted IR might work, I spun up a VM with 8 cores and 32Gb of memory and installed the IR on this thing. When I run the pipeline I can see the memory usage increase more and more, until it eventually hits 99%. CPU usage remains fairly low. When this happens no files are processed anymore and eventually I get the same time out. Once the pipeline fails, the memory usage remains at 99% and the only way to release it is to stop the IR service and restart it. It seems to me that with every batch of files additional memory is claimed, but this is not released when the files are copied to the dedicated pool, or when the process stops for that matter.

User's image

I'm at my wits end. I need to get this data into our database but nothing I try works. This seems to be a bug in the IR software. Any help on this would be much appreciated.

I'm at my wits end. I need to get this data into our database but nothing I try works. This seems to be a bug in the IR software. Any help on this would be much appreciated.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,138 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,134 questions
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,602 Reputation points Microsoft Employee
    2023-07-05T23:06:38.56+00:00

    @Anonymous Welcome to Microsoft Q&A forum and thanks for reaching out here.

    From you explanation you have try multiple alternatives to overcome the issue but the issue still remains.

    From the below message, my understanding is that you are doing a bulk insert load into your Synapse dedicated SQL pool using copy activity which might be resulting in low throughput. Please correct me if I'm wrong.

    User's image

    If that is the case, then I would recommend trying PolyBase or COPY Statement in your copy activity for tuning up the performance. Using PolyBase is one of the efficient way to load a large amount of data into Synapse Deducated Pool with high throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. The copy activity details image you have shared also suggest to use performance tuning tips to increase the throughput.

    For more information about it please refer to this document: Use PolyBase to load data into Azure Synapse Analytics
    User's image

    User's image

    Here is another couple of helpful documents related to performance tuning, please refer to them and see if that helps to resolve your issue:

    1. Load 1 TB into Azure Synapse Analytics under 15 minutes with Data Factory
    2. Copy activity Performance and tuning guide: Copy activity performance and scalability guide

    Hope this info helps. Let us know how it goes.

    Thank you

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.