Azure Data Factory - How to handle CRLF in my input file?

Question

Azure Data Factory - How to handle CRLF in my input file?

Vivek Komarla Bhaskar 956

Hello,

I'm processing a text file in ADF using dataflow, and the row limiter for all of the entries is CRLF. The problem is that for a few entries, the comments column, which is positioned at the last is spread across multiple lines.

Please give me some advice on how to deal with this scenario in ADF Data flow.

Example:

"site_name","container_uuid","action","comments"

"google.com","d4eb1580-3fa5-439d-8a54-66c0fc445290","created","That's what I meant Dana but my comment was - we to speak."

"google.com","d4eb1580-3fa5-439d-8a54-66c0fc445290","liked","Viva Israel.

Don’t forget the UN.

He and the UN want, and free."

"google.com","d4eb1580-3fa5-439d-8a54-66c0fc445290","liked","What is wrong with the biased broadcasting.

Every reputable organisation in the world - literally."

"google.com","03f0abf1-8294-4315-b0c0-0f02c8e3ab86","visible","I thought of going back to voting labour, having stopped after the war, and having the odious as my local MP. Despite the I thought I'd give them a chance."

Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-10-18T15:49:02.9366667+00:00
Try to read the data without trying to remove CRLF at the source. Use the Derived Column transformation to replace CRLF characters within the columns. Here is a thread about it https://stackoverflow.com/questions/70142048/adf-replace-special-characters

The expression is simple you need to define your output (sink) dataset and map the cleaned columns to the output.:

replace(your_column_name, '\r\n', ' ')
Vivek Komarla Bhaskar 956 Reputation points

2023-10-18T15:58:36.6133333+00:00

Hi, @Amira Bedhiafi This doesn't work. Are you able to try this and show the results?
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-10-18T18:26:24.2566667+00:00

I am a little bit confused. Can you please provide the input and expected output ?
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-10-20T22:43:21.8066667+00:00

@Vivek Komarla Bhaskar As @Amira Bedhiafi requested could you please help share a sample/dummy file with similar to that of your situation so that I can try it out on my end and get back to you with my findings. As this is data specific issue we need to replicate the problem.

Please attach a txt file here in your response so that I shall use it for testing your scenario. Please ensure to remove any sensitive information and mock the data as needed for testing.

Thank you
Vivek Komarla Bhaskar 956 Reputation points

2023-10-22T21:35:35.26+00:00

Hi @KranthiPakala-MSFT PFA the sample.

Datafactory.txt

My concern is reading the comments field when it is spread over multiple rows instead of a single one. What is the best way to read this using the data flows?
Vivek Komarla Bhaskar 956 Reputation points

2023-10-24T09:48:55.8466667+00:00

Hi @KranthiPakala-MSFT @Amira Bedhiafi Have you got any update, please?
Vivek Komarla Bhaskar 956 Reputation points

2023-11-06T14:34:03.96+00:00

Hi @KranthiPakala-MSFT @Amira Bedhiafi Any update, please?

1 answer

Your answer

Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-10-18T15:49:02.9366667+00:00

Try to read the data without trying to remove CRLF at the source. Use the Derived Column transformation to replace CRLF characters within the columns. Here is a thread about it https://stackoverflow.com/questions/70142048/adf-replace-special-characters

The expression is simple you need to define your output (sink) dataset and map the cleaned columns to the output.:

replace(your_column_name, '\r\n', ' ')
Vivek Komarla Bhaskar 956 Reputation points

2023-10-18T15:58:36.6133333+00:00

Hi, @Amira Bedhiafi This doesn't work. Are you able to try this and show the results?
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-10-18T18:26:24.2566667+00:00

I am a little bit confused. Can you please provide the input and expected output ?
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-10-20T22:43:21.8066667+00:00

@Vivek Komarla Bhaskar As @Amira Bedhiafi requested could you please help share a sample/dummy file with similar to that of your situation so that I can try it out on my end and get back to you with my findings. As this is data specific issue we need to replicate the problem.

Please attach a txt file here in your response so that I shall use it for testing your scenario. Please ensure to remove any sensitive information and mock the data as needed for testing.

Thank you
Vivek Komarla Bhaskar 956 Reputation points

2023-10-22T21:35:35.26+00:00

Hi @KranthiPakala-MSFT PFA the sample.

Datafactory.txt

My concern is reading the comments field when it is spread over multiple rows instead of a single one. What is the best way to read this using the data flows?
Vivek Komarla Bhaskar 956 Reputation points

2023-10-24T09:48:55.8466667+00:00

Hi @KranthiPakala-MSFT @Amira Bedhiafi Have you got any update, please?
Vivek Komarla Bhaskar 956 Reputation points

2023-11-06T14:34:03.96+00:00

Hi @KranthiPakala-MSFT @Amira Bedhiafi Any update, please?

Answer 1

I checked the file you provided and I tried to read the entire row as a single column.

This means you will not split the data into multiple columns based on commas.

If using CRLF \r\n as the row delimiter is not in the cards, I would go for splitting the rows into columns manually by applying a split function based on the comma.

While splitting, you need to be careful with the quotes and commas as they can be part of the data as well.

You will have a challenge with multi-line fields, so you need to create a rule or condition to identify such cases.

One way to identify them is by counting the number of quotes. An uneven number of quotes can be an indicator of a multi-line field.

If a multi-line field is identified, you can merge it with the next row. This step might be iterative until all fields are properly aligned.

Now, once you have a clean and structured row, you might want to clean up, like removing extra quotes or any additional special characters that might have been introduced during the process.

Here is my logic :

in your source settings, set the row delimiter as CRLF
in a derived column, you can use expressions where you can apply the logic to merge rows and split them into columns.

For example, if you read each row as a single string, you might do something like this:

split(column1, ',')[0]   // To get the value of "site_name"

split(column1, ',')[1]   // To get the value of "container_uuid"

Here is an example :

if(equals(length(column1) - length(replace(column1, '"', '')), 1), 'start', 
   if(equals(length(column1) - length(replace(column1, '"', '')), 1), 'end', 'complete'))

in a onditional Split you can route data rows to different outputs based on 3 conditions :
If a row is likely starting a multi-line field ( it has an uneven number of quotes), you might flag it as 'start'.
If a row is likely ending a multi-line field, you might flag it as 'end'.
Rows that are complete and don’t require further processing can be directly sent to the output.

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-10-30T16:50:55.4033333+00:00

@Vivek Komarla Bhaskar Checking to see if you have got a chance see the above response from Amira. Kindly let us know how it goes or if you have any questions.

Thank you
Vivek Komarla Bhaskar 956 Reputation points

2023-11-01T10:42:29.17+00:00

Hi @Amira Bedhiafi Are you able to attach the screenshot of mapping data flow? I don't understand how the split works when the data at source is spread across multiple lines.
Vivek Komarla Bhaskar 956 Reputation points

2023-11-06T14:34:21.8433333+00:00

Hi @KranthiPakala-MSFT @Amira Bedhiafi Have you got any update, please?
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-11-09T17:04:07.8566667+00:00

Can you please provide more data ?

Share via

Azure Data Factory - How to handle CRLF in my input file?

1 answer

Your answer