Help tuning an Azure Synapse ETL pattern

Jeff Born (J&CLT-ATL) 106 Reputation points
2023-08-16T21:41:20.32+00:00

I'm in the process of bringing our first Azure Synapse ETL process online and the first volume test failed miserably. I'm looking to the community to help me understand if this is just a configuration issue, or if our approach is all wrong. We have a new Order Microservice that creates or updates an order. This triggers a message to the Service Bus. From the service bus we have another service that pull the message off the service bus and create a file per create or update. This file is written to our ADLS Gen 2 Storage Account container. All of this is working at volume.

The problem comes in on the Synapse side.

  1. We have a storage account trigger that is fired for each file placed on the container in a specific folder that ends with order.parquet. The storage account trigger wasn't firing unless I set ignore empty blobs to "No"
  2. The trigger is tied to a two step pipeline.
    1. First step figure out the name of the environment and sets the storage account name
      1. Step two calls a Synapse Notebook to process the file
      2. 2 Base parameters
      3. SourceFileName
      4. StorageAccountName (set in previous step)
      5. Executor Size Small
      6. Dynamically allocate executors is enabled
      7. Min & Max executors are blank/not set
      8. Driver size is Small
      9. Concurrency is set to the default of 1 (This might be important later?)
  3. The Synapse Notebook is in PySpark and has 6 code blocks
    1. parameters block
      1. Read the source file
        1. read the destination table/file. Destination is a partitioned parquet file in a folder on the same storage account as the source file
          1. A spark.sql command to read the source
            1. A spark.sql command to read the destination
              1. A step that
              2. unions the source and destination
              3. writes the resulting dataframe to a temp folder, same storage account as above
              4. reads this write into an output dataframe
              5. Writes the output dataframe to the destination
              6. removes the temp parquet folder.

I'm getting this error frequently: Execution has been throttled as concurrency value is set and maximum number of waiting runs have been queued. Please wait for current runs to complete or cancel stuck runs.Like I can only process the first 100 before this error start popping up.

Digging around I also see the Apache Spark applications spins up about every 90 seconds for just over a minute:

User's image

I have multiple questions at this point:

  • Is processing single files at a time one after the other possible? Nothing here is doing parallel processing of multiple files a the same time.
  • The Synapse pipeline is taking a little over a minute once it actually starts to run
    • I was thinking each pipeline would run in milliseconds, what am I missing here?
  • I'm seeing the storage event trigger failing with 0 byte files. Over 1000 failures like this. Do I need to have a .success file written after the real file is completely written to trigger my processing of the real file?
  • I'm thinking that if multiple files come in at the same time they should be partition? and once the partition has been written to I can process multiple files at once. Thus saving spin up time of the pipeline and Synapse notebook? If so what does that look like/how do I do that?
  • What am I doing wrong/what configurations did I miss?
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,553 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,316 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,453 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 31,396 Reputation points
    2023-08-17T15:08:58.13+00:00

    Without more details about your environment, I am assuming the following (I will answer point by point) :

    Processing single files one after the other is possible but can be inefficient in a distributed environment like Azure Synapse. Spark is designed to handle parallel processing of data. Processing individual files one at a time might lead to a lot of overhead, particularly in your case, where the pipeline execution time seems quite significant.

    A minute per run may seem high, but that's not unusual for Spark. The overhead comes from provisioning resources, initializing the job, reading and writing data.... If you're looking for millisecond-level processing, Spark may not be the best choice. Spark excels at processing large batches of data in parallel.

    Returning back to the error message you mentioned, I clearly see that it indicates that you are hitting the limit on the number of concurrent executions. You can either increase this limit or adjust your approach to process more files in each execution.

    If you're seeing 0-byte files, it could be that the trigger is firing before the write is complete. Using a success file, as you mentioned, can be a way to ensure that the trigger only fires after the write is complete.

    It is recommended to process multiple files in parallel. You could consider triggering the pipeline at regular intervals (for example every 10 minutes) and then processing all available files in that batch. This would allow you to take advantage of Spark's parallel processing capabilities, reduce the overhead of spinning up individual jobs, and likely improve performance.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.