Add column to CSV File from another CSV File (Azure Data Factory)

Question

Add column to CSV File from another CSV File (Azure Data Factory)

Mvdit0 1

For example:

Persons.csv

name, last_name
-----------------------
jack, jack_lastName
luc, luc_lastname

FileExample.csv

id
243
123

Result:

name, last_name, exampleId
-------------------------------
jack, jack_lastName, 243
luc, luc_lastname, 123

I want to aggregate any number of columns from another data source, to insert that final result in a file or in a database table.

I have been trying many ways but I can't do it.

3 answers

Your answer

Answer 1

MarkKromer-MSFT 5,231 Microsoft Employee Moderator

What is your join condition?

Mvdit0 1 Reputation point

2022-04-05T05:08:32.947+00:00

I don't need to compare columns, there's not join condition.
MarkKromer-MSFT 5,231 Reputation points Microsoft Employee Moderator

2022-04-05T05:10:59.403+00:00

In your example, you are joining your data. How do you know that Jack is ID 243? Are you simply concatenating row 1 from file 1 with row 1 from file 2?
Mvdit0 1 Reputation point

2022-04-05T05:13:45.52+00:00

That's right, that's all I need at the moment, is it possible to do that?

Answer 2

MarkKromer-MSFT 5,231 Microsoft Employee Moderator

Here is one way to solve it:

Create a new data flow
Add 2 sources: 1 for Persons.csv and 1 for FileExample.csv
Add a surrogate key transformation after each source, names the keys as sk1 and sk2 respectively
Add a Join transformation and join on sk1 == sk2
After the Join, add a Select transformation and remove the sk1 and sk2 columns

Mvdit0 1 Reputation point

2022-04-05T20:35:29.053+00:00

I got it with these instructions. The primary key of the third table is auto-incremented. And I need to get the IDs per inserted row, how can I get them, because after the "sink" it doesn't allow to add another activity or function anymore.

I need those IDs to save them as evidence in a CSV file.

Sorry for my bad English.
MarkKromer-MSFT 5,231 Reputation points Microsoft Employee Moderator

2022-04-05T20:54:24.523+00:00

If you are trying to store the IDs of each row in a CSV file, then you can add a New Branch just before your Sink transformation. In that new branch, add another sink and write the data to a CSV.
Mvdit0 1 Reputation point

2022-04-05T21:10:34.407+00:00

Yes, but the IDs are from the "Result" table and it is not the exampleId. And it is auto-incrementable.
MarkKromer-MSFT 5,231 Reputation points Microsoft Employee Moderator

2022-04-05T21:48:28.517+00:00

Sounds like you want to capture the resulting auto-incrementing ID that SQL Server uses when writing the rows, is that right? If that's it, this should work: Add a 2nd source that reads from the table you are writing to. In the data flow sink order (under settings), make sure the sink that writes the data is set first in order. This way, the read of the table with the auto-increment ID should begin after you've written the rows.
Mvdit0 1 Reputation point

2022-04-05T23:51:27.77+00:00
I don't understand why I can't upload images but this is the flow I have:

source People (CSV file) -> surrogate key -> join (with surrogate key of ExampleFile) -> Sink (Sql table, Result table).

source ExampleFile(CSV file) -> surrogate key

source (Result Table)

When I run the pipeline in the result I don't see any ID of the last inserts (that's all I want to get).

Do I have to do anything else, because since that does not work for me.

PD: First of all thank you for your patience and help so far.
MarkKromer-MSFT 5,231 Reputation points Microsoft Employee Moderator

2022-04-06T00:05:25.953+00:00

Let me put together an example for you. I'll try to get to it today or tomorrow.

Answer 3

MarkKromer-MSFT 5,231 Microsoft Employee Moderator

You pattern will look something like this:

2 Delimited Text sources that you join on the surrogate keys and then write to the SQLSink. A 3rd source is the same SQL table that you write to in the sink. Notice I've set the sink ordering to ensure that I write the data first (SQLSink), then read back the auto-incremented IDs after the table write has been committed. The query I'm using in the ReadFromSQL just reads the data from that table so that I can write the IDs to my OuputIDs CSV file.

Mvdit0 1 Reputation point

2022-04-06T14:18:23.32+00:00

Sorry for my bad English. It works correctly, however, it captures all the IDs in the table and not just the ones inserted at the time. I also tried selecting the "checkbox" and it does not work. :(
MarkKromer-MSFT 5,231 Reputation points Microsoft Employee Moderator

2022-04-06T17:04:07.407+00:00

To only get the latest inserted row, you can use the "incremental" checkbox on the ReadFromSQL source in my example. However, you must include a date/time column that ADF can use to determine which rows were inserted.

Share via

Add column to CSV File from another CSV File (Azure Data Factory)

3 answers

Your answer