question

ShaunBailey-5325 avatar image
0 Votes"
ShaunBailey-5325 asked ShaunBailey-5325 commented

For Each For Multiple Files

Ok, I have something that seems straightforward, but I'm stuck on how to implement...

First, I have a large CSV file that I need that I need to split into smaller JSON files.

Next, for EACH of those files, I need to extract EACH line of JSON and push that into a stored procedure.

Where I am getting stuck is on multiple files. I'm able to easily push a single file into a for loop, set my variable needed for the stored procedure and run the stored procedure, but I can't figure out how to do this for multiple files. Any help would be much appreciated!

azure-data-factory
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered ShaunBailey-5325 commented

Hello @ShaunBailey-5325 and welcome to Microsoft Q&A.

As I understand, the goal is to push each record of an enormous file, one by one into a stored procedure. However due to the size, you wish to first break up that large JSON into multiple smaller CSV's.

Wait, did I get the JSON and CSV swapped?

Anyway, there are several limitations to note.

As you have discovered, you are not allowed to put one loop inside another loop. However, you can put an Execute Pipeline inside a loop, and then in the child pipeline put another loop.

What you might not know, is that putting a Set Variable inside a forEach loop can be dangerous. When you have multiple instances of the loop running at the same time, the Set Variables will try to write to the same place. This causes confusion when one instance later tries to read the variable, only to find what a different instance wrote! This is why you should try to minimize Set Variable inside a loop, unless it runs in sequence, not parallel.

There are 3 activities which can support Stored Procedure. Copy Activity. Lookup activity. Stored Proc activity. Which one you use depends upon whether you are expecting data to be returned to the Factory in response to the procedure running.
If you are expecting return data, use Lookup activity.
If the data/parameters you want to send is in a variable, use Stored Procedure activity. Also works when there is no parameters to send, and you just want to run the proc, like truncating a table.
If the data you want to send is in a file, use Copy Activity.

You did specify sending 1 row per execution of stored procedure, as opposed to a batch of rows at a time. Normally CopyActivity does it in batches, however we can control the batch size! So, if we set the write batch size to 1, we end up sending 1 row at a time!

205609-image.png

The reason I am leaning towards Copy Activity is because you would otherwise need to fetch all the records with a Lookup activity so you could feed them to Stored Proc or Lookup activities. The Lookup activity has a limitation on how many rows / bytes are returned. It is not always possible to work around this limitation.

So, you might not need 2 loops for this. One activity to divide up the file into smaller parts. One loop to go over each smaller part. Inside the loop one copy activity.

This leaves the implementation of the division.

When the sink is delimited text AND the target is a folder, you can choose the maximum number of rows per file. This effects breaking up the big file into smaller csv's. This feature is not available when the sink is JSON.

This is one reason I wanted to double check which end was CSV and which end was JSON. The other reason, is more use cases transform hierarchical JSON into CSV for easier loading into SQL. Going the other direction seemed a little odd, unless you are intentionally encoding a flat record into a hierarchical form and stuffing that into a single cell.

Even if we are going from CSV to JSON, that is no big deal. We just need an extra copy activity.

Big CSV ----> many small CSV ---> many small JSON

That is a copy activity to break up the big csv into many smaller csv's . Then a forEach loop on the many small csv's. Inside the loop, first a copy to change from small csv to small json. Then another copy to push json to SQL, using write batch size = 1. End loop.

The sink dataset of the Big csv -> little csv , can be the same as source dataset of little csv --> little json.


Please do let me if you have any queries.

Thanks
Martin


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators



image.png (83.3 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Wow! Thank you @MartinJaffer-MSFT for the very detailed answer! In kind, I'd like to provide you some more context so that you're well informed on exactly what I am wishing to accomplish.

I have a single CSV file that has roughly 200 columns of data that I am receiving daily. That file could have anywhere from 1 to 20,000 records in it at a time. Essentially I need to pass in each row of the CSV as a parameter (in JSON format) to a stored procedure that does all the heavy lifting to push the data into the respective tables.

So, if I can simply use the Copy Data, then perhaps I can simplify my design substantially. The question now would be how do I pass in the @Json parameter needed for the stored proc?

The follow up question would be is it possible to have multiple parallel versions of the stored procedure running? Since the file can vary in number of records, I want to make sure I can process all of them in a timely manner to avoid the job running over 24 hours and not finishing in time for the next daily file to start.

Source:
205575-config1.png

Sink:
205610-config2.png


0 Votes 0 ·
config1.png (42.6 KiB)
config2.png (51.9 KiB)