ADF Debug pipeline : Use activity runtime

Veena 0 Reputation points
2024-07-11T10:27:54.8133333+00:00

What is the difference between 'Use data flow debug session' and 'Use activity runtime' in debug in Azure Data Factory?

I have a pipeline with a Lookup activity followed by a ForEach activity.

A lookup file is used to store parameters, and this file is passed to a ForEach activity.

The ForEach activity consists of another pipeline which has a dataflow. The compute size and core count are passed as parameters from the Look up file. The core count is defined as an integer value.

  • On running the debug with 'Use data flow debug session' option, the pipelines runs successfully. These are the compute parameter values passed as input to the data flow : "compute": {
        "coreCount": "16",
      
      
        "computeType": "'General'"
    
    },
  • Input to the inner pipeline [which contains the dataflow] :
        "p_compute_size": "'custom'",
      
      
        "p_compute_type": "'General'",
      
      
        "p_core_count": "16"
      ```However, the other option fails.
    
    
  • On running the debug with 'Use activity runtime', ForEach activity fails with the error : 'Failure type - User configuration issue'. The dataflow fails with the error : The request failed with status code '"BadRequest"'. The parameter passed is the same :
       "compute": {
      
      
          "coreCount": "16",
      
      
          "computeType": "'General'"
      
      
      },
      ```- Input to the inner pipeline [which contains the dataflow] :  
    
      ```python
        "p_compute_size": "'custom'",
      
      
        "p_compute_type": "'General'",
      
      
        "p_core_count": "16"
      ```This is how it is defined in the Lookup file :
    
    
		"computeSize": "custom",

		"computeType": "General",

		"coreCount": 16
```I don't understand why one option runs, but the other fails even with the exact value for parameters.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,877 questions
{count} votes

1 answer

Sort by: Most helpful
  1. AnnuKumari-MSFT 33,476 Reputation points Microsoft Employee
    2024-07-11T18:36:23.5533333+00:00

    Hi Veena ,

    Welcome to Microsoft Q&A platform and thanks for posting your query here.

    The difference between these two options is how the IR is used when running the debug/activity runtime sessions.

    use dataflow debug session:

    This option allows you to debug a data flow in a separate debug session. This debug cluster is separate from the original cluster that was used to run the data flow in the pipeline. The debug cluster is typically smaller and less powerful than the original cluster, which is more cost-effective.

    It uses the default Autoresolve IR with small compute size , as can be seen when you try to switch on the dataflow debug option:
    User's image

    I tried to reproduce your scenario by parameterizing the compute type and core count which is fetch by lookup file and passed down inside foreach. However, unlike what you mentioned, compute size doesn't have 'add dynamic content' option , we need to manually select 'custom' for compute size. It cant be dynamically passed via parameter.

    User's image

    You can see , it took 1m 38secs (for my workload) for the processing time using the dataflow debug session. It takes less time to spin up the small cluster.User's image

    use activity runtime:

    When you use the "Use Activity Runtime" option, ADF runs the data flow as part of the pipeline activity runtime, using the original cluster that was configured in the data flow integration runtime. "Use Data Flow Debug Session" option uses the original cluster, which is typically more powerful and expensive than the debug cluster.

    Here, we can optimize the integration runtime on the actual pipeline based on the time taken to run the data flow. If the data flow is taking too long to run, we may need to adjust the integration runtime to use a more powerful cluster or to optimize the data flow itself.

    You can see , the processing time as 3m 22secs when using the activity runtime. The reason for more time is spinning the actual compute takes more time.

    User's image

    You can see this video demonstration on these both options.

    Additionally, coming to the error you are facing, "The dataflow fails with the error : The request failed with status code '"BadRequest"' , It usually comes when there is any syntactical error in the pipeline. Kindly share if there is any configuration changes between the two approach so that we can troubleshoot and help better. Thankyou

    Hope it helps. Kindly accept the answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.