Issue in copying large excel file over 200mb from sftp to data lake in synapse pipeline

sumedhRanadhir 21 Reputation points
2022-11-22T07:46:13.79+00:00

While copying large excel file over 200MB from SFTP to data lake. Synapse pipeline gives timeout error after executing it for more than 12 hours.

  1. I am using copy activity task
  2. Able to copy file with less size successfully. However, for actual files its giving error.
  3. Cannot use data flow activity as my linked service is using SelfHostedIR.
  4. Source is excel file over SFTP and destination is parquet file on data lake gen2

I am open for any solution which will work with Synapse pipeline.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,418 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,652 questions
0 comments No comments
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,056 Reputation points
    2022-11-22T21:43:53.91+00:00

    Hello @fdgfdg,
    Thanks for the question and using MS Q&A platform.

    So as I understand, you have 2 actions you want to take. Copy from SFTP to datalake gen2, and transform from excel to parquet. Currently you are doing both at once.
    The file is >200MB and taking too long.

    Excel files are worked a little differently than other files like delimited text or even some simple database operations. When opening an Excel file, the entire thing gets loaded into memory. This can be very very expensive. The other file types can be done a piece at a time, not so for Excel.

    To make things worse, Self-Hosed IR often have smaller capacity than Azure IR because SHIR are tied to the host machine, wherease Azure IR can expand.

    I suggest you break this into 2 operations.
    First, do the copy using Binary dataset insted of Excel dataset, to move from on-prem to datalake.
    Second do another copy using Excel and parquet datasets to change the type.

    The benefit of this is to make the movement to Azure simple and fast. Binary dataset type can do any file. It doesn't care what type because it does not open the file for parsing, no massive load.
    Then you can take advantage of the Azure IR to do the transform. This way you get the benefit of more power.

    The first copy should be very fast as it is just moving data. Less than 10 minutes is my guess.
    The transform should be much faster, but still a significant duration. Large Excel files can get messy. Also you have the option of using Data Flow if you want, as it is already on Azure.

    As you have it now, the SHIR needs much more than 200mb RAM, especially if the Excel is more than just numbers. I suspect it got into high cache-miss situation. Having the SHIR only copy, not parse the Excel will definitely help.

    Please do let me if you have any queries.

    Thanks
    Martin


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

0 additional answers

Sort by: Most helpful