Help tuning an Azure Synapse ETL pattern

Question

Help tuning an Azure Synapse ETL pattern

Jeff Born (J&CLT-ATL) 106

I'm in the process of bringing our first Azure Synapse ETL process online and the first volume test failed miserably. I'm looking to the community to help me understand if this is just a configuration issue, or if our approach is all wrong. We have a new Order Microservice that creates or updates an order. This triggers a message to the Service Bus. From the service bus we have another service that pull the message off the service bus and create a file per create or update. This file is written to our ADLS Gen 2 Storage Account container. All of this is working at volume.

The problem comes in on the Synapse side.

We have a storage account trigger that is fired for each file placed on the container in a specific folder that ends with order.parquet. The storage account trigger wasn't firing unless I set ignore empty blobs to "No"
The trigger is tied to a two step pipeline.
1. First step figure out the name of the environment and sets the storage account name
  1. Step two calls a Synapse Notebook to process the file
  2. 2 Base parameters
  3. SourceFileName
  4. StorageAccountName (set in previous step)
  5. Executor Size Small
  6. Dynamically allocate executors is enabled
  7. Min & Max executors are blank/not set
  8. Driver size is Small
  9. Concurrency is set to the default of 1 (This might be important later?)
The Synapse Notebook is in PySpark and has 6 code blocks
1. parameters block
  1. Read the source file
    1. read the destination table/file. Destination is a partitioned parquet file in a folder on the same storage account as the source file
      1. A spark.sql command to read the source
        
        A spark.sql command to read the destination
        
        A step that
        
        unions the source and destination
        
        writes the resulting dataframe to a temp folder, same storage account as above
        
        reads this write into an output dataframe
        
        Writes the output dataframe to the destination
        
        removes the temp parquet folder.

I'm getting this error frequently: Execution has been throttled as concurrency value is set and maximum number of waiting runs have been queued. Please wait for current runs to complete or cancel stuck runs.Like I can only process the first 100 before this error start popping up.

Digging around I also see the Apache Spark applications spins up about every 90 seconds for just over a minute:

User's image

I have multiple questions at this point:

Is processing single files at a time one after the other possible? Nothing here is doing parallel processing of multiple files a the same time.
The Synapse pipeline is taking a little over a minute once it actually starts to run
- I was thinking each pipeline would run in milliseconds, what am I missing here?
I'm seeing the storage event trigger failing with 0 byte files. Over 1000 failures like this. Do I need to have a .success file written after the real file is completely written to trigger my processing of the real file?
I'm thinking that if multiple files come in at the same time they should be partition? and once the partition has been written to I can process multiple files at once. Thus saving spin up time of the pipeline and Synapse notebook? If so what does that look like/how do I do that?
What am I doing wrong/what configurations did I miss?

Answer accepted by question author

0 additional answers

Your answer

Answer 1

Without more details about your environment, I am assuming the following (I will answer point by point) :

Processing single files one after the other is possible but can be inefficient in a distributed environment like Azure Synapse. Spark is designed to handle parallel processing of data. Processing individual files one at a time might lead to a lot of overhead, particularly in your case, where the pipeline execution time seems quite significant.

A minute per run may seem high, but that's not unusual for Spark. The overhead comes from provisioning resources, initializing the job, reading and writing data.... If you're looking for millisecond-level processing, Spark may not be the best choice. Spark excels at processing large batches of data in parallel.

Returning back to the error message you mentioned, I clearly see that it indicates that you are hitting the limit on the number of concurrent executions. You can either increase this limit or adjust your approach to process more files in each execution.

If you're seeing 0-byte files, it could be that the trigger is firing before the write is complete. Using a success file, as you mentioned, can be a way to ensure that the trigger only fires after the write is complete.

It is recommended to process multiple files in parallel. You could consider triggering the pipeline at regular intervals (for example every 10 minutes) and then processing all available files in that batch. This would allow you to take advantage of Spark's parallel processing capabilities, reduce the overhead of spinning up individual jobs, and likely improve performance.

AnnuKumari-MSFT 34,566 Reputation points Microsoft Employee Moderator

2023-08-19T15:39:03.6166667+00:00

Jeff Born (J&CLT-ATL) ,

Just checking in to see if the above answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

Share via

Help tuning an Azure Synapse ETL pattern

0 additional answers

Your answer