Azure Data Factory - Divide a list of ids (JSON format file) into packets of 100 ids
Hello,
I'm using the copy data activity in a pipeline, which allows me to retrieve a list of Ids.
And I'd like to split this list of ids into a loop of 100 ids. Then each pack of 100 ids triggers a new activity
Is this possible?
Azure Data Factory
-
Suba Balaji 8,466 Reputation points
2023-09-05T13:22:57.3833333+00:00 Could you please confirm this -
suppose you have 950 records brought by source. you want to split/partition it in such a way that, you will see 9 files each with 100 records and 1 file with 50 records?
also, what do you mean by Then each pack of 100 ids triggers a new activity? Which activity do you want to trigger after partitioning source file?
please clarify to assist better
-
BONIER Charline 0 Reputation points
2023-09-05T14:31:06.95+00:00 Hi,
To be more precise, here are the steps I'm trying to implement via a synapse pipeline:
For the first step, I used the copy data activity and stored the list of Ids in a blob storage.
And I'm stuck for the next steps.
-
KranthiPakala-MSFT 43,432 Reputation points • Microsoft Employee
2023-09-06T22:52:33.12+00:00 @BONIER Charline Welcome to Microsoft Q&A forum and thanks for reaching out here.
As per the information you have provided, I assume you are following this document - Power BI Metadata Scanning - Perform a full scan
The solution with ADF pipeline would be more cumbersome, as it involves multiple API calls in batches and also, I see a limit of max 16 simultaneous calls at a time. Considering these restrictions, I recommend going through Synapse Spark notebooks or Azure Databricks notebooks by write your own code for this requirement.
The advantages of implementing this solution in Synapse spark or Databricks notebooks is that the API calls will be fast and you will see a lot difference in performance as well as the time taken to complete the whole process. If you implement the same solution in ADF pipelines, it will definitely take longer time as you have to develop activities in such a way that they split the batches of 100 and also honor
scanStatus
API limitation of 16 simultaneous calls.If there had been no API simultaneous call limitations, then it would have been straightforward with a web activity to just get the workspace ID list and then pass the whole array to a ForEach activity and then do batch settings as shown below.
But irrespective, the performance will be very slow when compared to Azure Synapse Spark notebooks or Azure Databricks Notebooks.
Hope this info helps.
Please don’t forget to
Accept Answer
andYes
for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members. -
BONIER Charline 0 Reputation points
2023-09-07T06:47:26.7466667+00:00 Thanks for all the information,
Yes, I am trying to implement the steps in this document (Power BI Metadata Scanning - Perform a full scan)
But I really want to use a synapse pipeline, could you put me an example of the whole process, I'm new to Synapse and I don't understand how to set up all the steps.
-
KranthiPakala-MSFT 43,432 Reputation points • Microsoft Employee
2023-09-11T19:36:57.6833333+00:00 @BONIER Charline Could you help please share your implementation details and where you have got stuck so that I can try to assist accordingly on the next steps?
But on a high-level, you can follow below to get started with:
- Add a Web activity1 to your pipeline to call the
workspaces/modified
API to get the complete list of workspace IDs in the tenant. - Add a ForEach activity to your pipeline, then configure the ForEach activity to use the output of the Web activity1 (step 1) as the input, and then configure to divide the list of workspace IDs into chunks of 100 at most using batch (100) settings.
- Inside the ForEach activity, add a Web activity2 to call the
workspaces/getInfo
API to trigger a scan call for each batch of 100 workspaces. - Use the URI from the location header in the response of the Web activity2 in step 3 to poll on workspaces/scanStatus/{scan_id} until the status returned is "Succeeded". To do so, add a Until activity to your pipeline to loop until the status is "Succeeded". Configure the Until activity to use the output of the Web activity2 in step 3 as the input.
- Inside the Until activity, add a Web activity3 to call the workspaces/scanStatus/{scan_id} API to get the status of the scan.
- Add a Web activity1 to your pipeline to call the
-
BONIER Charline 0 Reputation points
2023-09-12T06:50:36.3266667+00:00 Hi @KranthiPakala-MSFT,
Thanks for the clarification,
Is my ForEach set up correctly?
And do you know what I should put in the Web activity 2 body?
-
KranthiPakala-MSFT 43,432 Reputation points • Microsoft Employee
2023-09-13T22:56:45.4766667+00:00 @BONIER Charline Batch count setup seems correct but for
items
configuration, I will have to look your web activity response JSON payload. Could you please share the sample payload so that I can confirm. -
BONIER Charline 0 Reputation points
2023-09-14T06:55:40.35+00:00 -
KranthiPakala-MSFT 43,432 Reputation points • Microsoft Employee
2023-09-25T17:38:37.1733333+00:00 Are you receiving the backslash from the API response itself? if that is the case have you tried using the
@json()
function to load the response as a JSON to the output variable? -
BONIER Charline 0 Reputation points
2023-09-25T18:06:33.3766667+00:00 Hi
I get an error when I run the pipeline.
Where should I use the @json() ?
Sign in to comment