ADF JSON Source and MySQL Destination Dataflow Performance Issue

LiamG 1 Reputation point
2022-07-05T13:54:49.913+00:00

Hi,
I've set up an Azure Data Factory data flow that uses a JSON file from Azure Blob Storage. The JSON file has 23,000 items (rows) and 4 array attributes. I've added 4 Flatten Formatters and 5 destinations that are different tables in Azure Database for MySQL.
The problem I have is that it's incredibly slow to process. It's been running for 4 hours now. Is this type of processing time to be expected or have I done something incorrectly?
Thank,
Liam
217792-image.png

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. AnnuKumari-MSFT 34,566 Reputation points Microsoft Employee Moderator
    2022-07-06T04:56:01.357+00:00

    Hi @LiamG ,

    Welcome to Microsoft Q&A platform and thanks for posting your question.

    It seems you are facing performance issue in the mapping dataflow execution. Here are few things you can consider:

    To view detailed monitoring information of a data flow, click on the eyeglasses icon in the activity run output of a pipeline. For more information, see Monitoring mapping data flows.

    217985-image.png

    When monitoring data flow performance, there are four possible bottlenecks to look out for:

    • Cluster start-up time
    • Reading from a source
    • Transformation time
    • Writing to a sink

    218013-image.png

    Cluster start-up time is the time it takes to spin up an Apache Spark cluster. This value is located in the top-right corner of the monitoring screen. Data flows run on a just-in-time model where each job uses an isolated cluster. This start-up time generally takes 3-5 minutes. For sequential jobs, this can be reduced by enabling a time to live value. For more information, refer to the Time to live section in Integration Runtime performance.

    Data flows utilize a Spark optimizer that reorders and runs your business logic in 'stages' to perform as quickly as possible. For each sink that your data flow writes to, the monitoring output lists the duration of each transformation stage, along with the time it takes to write data into the sink. The time that is the largest is likely the bottleneck of your data flow. If the transformation stage that takes the largest contains a source, then you may want to look at further optimizing your read time. If a transformation is taking a long time, then you may need to repartition or increase the size of your integration runtime. If the sink processing time is large, you may need to scale up your database or verify you are not outputting to a single file.

    Once you have identified the bottleneck of your data flow, use the optimizations strategies to improve performance.

    For more details: Check Mapping data flows performance and tuning guide

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you.
      Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    1 person found this answer helpful.

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.