Re-use dataflows active cluster

Junior Steve KAMDEM DJOKO 0 Reputation points
2025-01-08T10:02:50.15+00:00

I have the following scenario. There are csv files that users can drop into the data lake at any time.

I need to check each csv file in the data lake in real time. Part of the verification is done with dataflows in azure synapse analytics. The configuration is the following

  • The synapse workspace is not in a managed virtual network
  • Every data flows runs on a unique Azure Integration Runtime (not the default Azure IR)
  • The TTL (Time To Live) for that runtime is set to 1 hour : meaning that spark cluster (for dataflows) remains active for one hour

This means that if I have a 1st file dropped, I'd like, for example, the second file dropped 10min after the 1st file to use the same active dataflow cluster that processed the 1st file.

However, this is not the case. Sometimes files are dropped using the same cluster, or a new cluster is started, even though there's already an active cluster.

How can I ensure that if a cluster is active, all new csv files that arrive are processed by this cluster as long as it's active?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,145 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,152 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Vinodh247 27,281 Reputation points MVP
    2025-01-09T00:36:41.8066667+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    There are some factors and techniques that can help you increase the likelihood that newly arrived csv files will reuse an already active Spark cluster in Synapse Data Flows. However, be aware that there is no absolute guarantee given synapse’s orchestration logic that the same cluster will always be reused.

    • Batch or sequentially handle new CSV files rather than firing up completely separate pipelines per file drop.
    • Restrict concurrency (trigger and pipeline) so that you do not spin up multiple Data Flows in parallel.
    • Point all Data Flow executions to the same Azure IR with an appropriate TTL.
    • Reduce parallel runs so that the same cluster can be reused.

    With these steps, you will significantly increase the chance that new csv files are processed by the already active spark cluster, taking advantage of the 1-hour TTL whenever possible.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

    0 comments No comments

  2. phemanth 13,145 Reputation points Microsoft Vendor
    2025-01-10T06:28:47.4766667+00:00

    @Junior Steve KAMDEM DJOKO

    Thanks for using Microsoft Q&A forum and posting your query.

    To ensure that all new CSV files are processed by the same active dataflow cluster in Azure Synapse Analytics, you can follow these steps:

    1. Set the Time To Live (TTL) for the Integration Runtime (IR): You've already set the TTL to 1 hour, which is good. This keeps the cluster alive for a specified period after the execution completes, allowing new jobs to reuse the existing cluster if they start within the TTL period 258322-image.png
    2. Enable Quick Re-use Option: In the Azure Integration Runtime under Data Flow Properties, set the "Quick re-use" option to true. This prevents the service from tearing down the existing cluster after each job, keeping the compute environment alive for the TTL period

    161264-image.png

    1. Sequential Execution: Ensure that your dataflows are executed sequentially rather than in parallel. Only one job can run on a single cluster at a time. If multiple dataflows start simultaneously, a new cluster will be spun up for each
    2. Monitor Cluster Usage: Regularly monitor the cluster usage and adjust the TTL settings if necessary. This helps in optimizing the cluster's availability and resource utilization.

    By following these steps, you can improve the chances of reusing the same active cluster for processing new CSV files as they arrive.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.