Optimizing spark streaming applications

Question

I am investigating the process of submitting and executing Spark streaming applications within Synapse Spark pools.

In Synapse pipelines, the 'timeout' parameter for the 'spark job definition' activity specifies the maximum duration an activity can run. By default, this duration is set to 12 hours, with a maximum limit of 7 days. However, since streaming applications operate continuously, the 7-day limit is not suitable.

What are the recommended best practices for effectively running Spark streaming applications within Synapse, considering the continuous nature of these applications and the limitations imposed by the timeout parameter?

Accepted Answer

Hi @vikranth-0706

Thank you for reaching out to the community forum with your query.

When you start the Spark job using the pipeline, it's set to run for 12 hours by default, and it can run for a maximum of seven days. This is because the pipeline isn't meant for continuous streaming but for processing data in batches. Running a pipeline indefinitely isn't recommended.

For batch processing, it's a good idea to break down the pipeline into smaller jobs.

If a streaming application needs to run for more than seven days, we should automate the restart process using Azure Functions or logic apps. These tools offer more flexibility in scheduling the jobs.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Optimizing spark streaming applications

0 additional answers