Hi @Priya Jha ,
Thanks for using Microsoft Q&A forum and posting your query.
By default, every data flow activity spins up a new Spark cluster based upon the Azure Integration Runtime (IR) configuration. Cold cluster start-up time takes a few minutes. If your pipelines contain multiple sequential data flows, you can enable a time-to-live (TTL) value, which keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL duration, it will reuse the existing cluster and start up time will be greatly reduced.
After the second job completes, the cluster will again stay alive for the TTL time.
You can additionally minimize the startup time of warm clusters by setting the "Quick re-use" option in the Azure Integration runtime under Data Flow Properties. Setting this to true will tell the service to not teardown the existing cluster after each job and instead re-use the existing cluster, essentially keeping the compute environment you've set in your Azure IR alive for up to the period of time specified in your TTL. This option makes for the shortest start-up time of your data flow activities when executing from a pipeline.
However, if most of your data flows/pipelines execute in parallel, it is not recommended that you enable TTL for the IR that you use for those activities. Only one job can run on a single cluster at a time. If there is an available cluster, but two data flows start, only one will use the live cluster. The second job will spin up its own isolated cluster.
Is it possible that i create a dedicated cluster and use that in all my DataFlows instead of DataFlows creating their personal clusters?
- No, it is not possible. This is possible in Azure Databricks but not in ADF as the clusters are managed by ADF.
If the above scenario is not possible then which offering can be used to run multiple python code on a dedicated cluster apart from DataBricks and Azure Batch as we don't have permissions to use these two offerings in our project.
- Other than Databricks and Azure Batch, you may try exploring Azure Synapse Analytics Notebooks to execute your Python code. Synapse notebooks (Nothing but Apache Spark Notebooks) support four Apache Spark languages:
a) PySpark (Python)
b) Spark (Scala)
c) Spark SQL
d) .NET Spark (C#)
To explore more about Synapse notebooks, please refer here - Create, develop, and maintain Synapse notebooks in Azure Synapse Analytics
Hope this info helps.
----------
- Please don't forget to click on and upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
- Want a reminder to come back and check responses? Here is how to subscribe to a notification
- If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators