Azure Data Factory - How to handle CRLF in my input file?

Vivek Komarla Bhaskar 956 Reputation points
2023-10-18T14:48:56.3433333+00:00

Hello,

I'm processing a text file in ADF using dataflow, and the row limiter for all of the entries is CRLF. The problem is that for a few entries, the comments column, which is positioned at the last is spread across multiple lines.

Please give me some advice on how to deal with this scenario in ADF Data flow.

Example:

"site_name","container_uuid","action","comments"

"google.com","d4eb1580-3fa5-439d-8a54-66c0fc445290","created","That's what I meant Dana but my comment was - we to speak."

"google.com","d4eb1580-3fa5-439d-8a54-66c0fc445290","liked","Viva Israel.

Don’t forget the UN.

He and the UN want, and free."

"google.com","d4eb1580-3fa5-439d-8a54-66c0fc445290","liked","What is wrong with the biased broadcasting.

Every reputable organisation in the world - literally."

"google.com","03f0abf1-8294-4315-b0c0-0f02c8e3ab86","visible","I thought of going back to voting labour, having stopped after the war, and having the odious as my local MP. Despite the I thought I'd give them a chance."

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-10-24T10:16:07.55+00:00

    I checked the file you provided and I tried to read the entire row as a single column.

    This means you will not split the data into multiple columns based on commas.

    If using CRLF \r\n as the row delimiter is not in the cards, I would go for splitting the rows into columns manually by applying a split function based on the comma.

    While splitting, you need to be careful with the quotes and commas as they can be part of the data as well.

    You will have a challenge with multi-line fields, so you need to create a rule or condition to identify such cases.

    One way to identify them is by counting the number of quotes. An uneven number of quotes can be an indicator of a multi-line field.

    If a multi-line field is identified, you can merge it with the next row. This step might be iterative until all fields are properly aligned.

    Now, once you have a clean and structured row, you might want to clean up, like removing extra quotes or any additional special characters that might have been introduced during the process.

    Here is my logic :

    • in your source settings, set the row delimiter as CRLF
    • in a derived column, you can use expressions where you can apply the logic to merge rows and split them into columns.

    For example, if you read each row as a single string, you might do something like this:

    split(column1, ',')[0]   // To get the value of "site_name"
    
    split(column1, ',')[1]   // To get the value of "container_uuid"
    
    

    Here is an example :

    if(equals(length(column1) - length(replace(column1, '"', '')), 1), 'start', 
       if(equals(length(column1) - length(replace(column1, '"', '')), 1), 'end', 'complete'))
    
    • in a onditional Split you can route data rows to different outputs based on 3 conditions :
    • If a row is likely starting a multi-line field ( it has an uneven number of quotes), you might flag it as 'start'.
    • If a row is likely ending a multi-line field, you might flag it as 'end'.
    • Rows that are complete and don’t require further processing can be directly sent to the output.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.