Remove newlines inside a text field of dynamic csv columns

AMJ 6 Reputation points
2022-04-28T04:13:22.803+00:00

As mentioned in the title there are newlines ({CRLF}) embedded in text fields of csv files such as

,,,,,"this is text field {CRLF} which finishes here",,,,,

Normally you will use a function like: regexReplace(<Column name>, ,"([\r\n|\r|\n]+)",, ' ') https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions to replace {CRLF} with a space. The issue is the <Column name> is dynamic. Here is the simplified extract of a csv file:

197119-image.png

As one can see the text field denoted by a pair of double quotes has span multiple lines as the result of the embedded {CRLF} within the text field. The dynamic <Column name> means you don't know which column in a csv file is a text field. Also there would be multiple csv files through a single pipeline. The schema in each csv file is different. Please chip in your thoughts.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,272 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,031 questions
{count} votes

3 answers

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,486 Reputation points Microsoft Employee
    2022-05-02T06:38:02.973+00:00

    Hi @AMJ ,

    Thank you for posting query in Microsoft Q&A Platform.

    If possible try to see if you can apply your replace logic on response itself and then storing it as csv file. That way it will reduce lot of efforts.

    If you would like to play around the logic on csv file after loading data from API then we should consider using data flows. Firstly, try to read entire file data as single column. For the you should have your dataset settings as below.
    198161-image.png

    Once you get entire file data as single column then apply your replace logic using derived column. and then consider breaking your single column single data as multiple items using split() function on space. But if your any column has space in there data then that will also get splitted. Hence we should consider omit splitting of data which is wrapped insider quotes.

    Once we splitted data as array then flatten that data as multiple rows and load as file. and then use this newly created file as source for further processes.
    198097-image.png

    All above mentioned logic handling may be difficult and sometimes dataflows not much flexible to have this logic. So best way is having your custom code written in Azure functions to read your file data and do the replace() accordingly.

    Hope this helps. Please let us know how it goes. Thank you.

    --------
    Please consider hitting Accept Answer. Accepted answers help community as well.

    1 person found this answer helpful.
    0 comments No comments

  2. Pratik Somaiya 4,206 Reputation points
    2022-04-28T05:27:20.527+00:00

    @AMJ

    Can you apply the Replace function in your API query body itself? Are you using a Web Activity for API call or any via other way?


  3. AMJ 6 Reputation points
    2022-05-02T07:52:42.487+00:00

    @ShaikMaheer-MSFT Thank you so much for your thoughts. Unfortunately it won't work. I modified your sample data to closely imitate the real situation as below:

    198165-image.png

    As you could see there are 2 issues applying your method to a read data:

    • There are spaces within a column which is not surrounded by a pair of double quotes. They are legitimate in one column;
    • There exist double quotes within the double quotes of the text delimiter
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.