How to fix "out of memory exception " while processing a pipeline of around 43 gb of data using copy activity?

Lakshmi Moulya Nerella 0 Reputation points
2024-06-24T10:25:41.07+00:00

I am processing a pipeline of around 43gb of data using copy activity and i am getting the error as :

ErrorCode=SystemErrorOutOfMemory,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A task failed with out of memory.,Source=,''Type=System.OutOfMemoryException,Message=Array dimensions exceeded supported range.,Source=System.Core,'

Failure Type is "SystemError"

If there is any workaround for this it will be helpful

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,111 questions
{count} votes

2 answers

Sort by: Most helpful
  1. 梁友泽 75 Reputation points
    2024-06-24T10:33:50.2533333+00:00

    When processing large datasets like 43 GB using a copy activity, encountering an "out of memory" exception is a common issue. This typically occurs because the system is attempting to load too much data into memory simultaneously. Here are several strategies to mitigate this issue:

    Increase Resource Allocation:

    • Ensure that the system running the pipeline has sufficient memory and CPU resources. Sometimes, simply increasing the available resources can solve the problem.

    Use Parallelism:

      - Enable parallel copy in your activity settings. This splits the data transfer into multiple threads, which can help in managing memory usage better. Adjust the degree of parallelism to find an optimal setting that your system can handle.
      
      **Use Staging:**
      
         - If you're copying data between different services (e.g., from on-premises to cloud), consider using staging options such as Azure Blob Storage as an intermediate step. This reduces memory load by breaking the process into smaller, more manageable steps.
         
         **Batch Processing:**
         
            - Break down the data into smaller chunks or batches. Process each batch separately to avoid loading the entire dataset into memory at once.
            
            **Data Compression:**
            
               - If the data is not already compressed, consider compressing it before the transfer. This reduces the amount of data that needs to be handled at any given time.
               
               **Optimize Source and Sink Configuration:**
               
                  - Ensure that the configurations for your source and sink are optimized for large data transfers. This includes setting appropriate timeouts, increasing buffer sizes, and using efficient data formats.
                  
                  **Monitoring and Scaling:**
                  
                     - Continuously monitor the memory usage during the pipeline execution. Based on the observations, scale your resources up or down as needed.
                     
    

    Error Handling and Retries:

    • Implement robust error handling and retry logic. Sometimes transient errors can cause memory issues, and having a strategy to retry can help in completing the transfer.

  2. Bhargava-MSFT 28,946 Reputation points Microsoft Employee
    2024-06-24T17:53:32.0033333+00:00

    Hello Lakshmi Moulya Nerella,

    Are you using SHIR here?

    Here are few suggestions:

    1. Register and online self-hosted IR with powerful machine (high CPU/Memory) to read data from the big file through copy activity.
    2. Use memory optimized + big size (for example, 48 cores...) cluster to read data from the big file through dataflow activity.

    User's image

    3. Split big file into small ones, then use copy or dataflow activity to read the folder.

    0 comments No comments