Remove newlines inside a text field of dynamic csv columns

Question

As mentioned in the title there are newlines ({CRLF}) embedded in text fields of csv files such as

,,,,,"this is text field {CRLF} which finishes here",,,,,

Normally you will use a function like: regexReplace(, ,"([ | | ]+)",, ' ') https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions to replace {CRLF} with a space. The issue is the is dynamic. Here is the simplified extract of a csv file:

As one can see the text field denoted by a pair of double quotes has span multiple lines as the result of the embedded {CRLF} within the text field. The dynamic means you don't know which column in a csv file is a text field. Also there would be multiple csv files through a single pipeline. The schema in each csv file is different. Please chip in your thoughts.

Answer

Hi @AMJ ,

Thank you for posting query in Microsoft Q&A Platform.

If possible try to see if you can apply your replace logic on response itself and then storing it as csv file. That way it will reduce lot of efforts.

If you would like to play around the logic on csv file after loading data from API then we should consider using data flows. Firstly, try to read entire file data as single column. For the you should have your dataset settings as below.

Once you get entire file data as single column then apply your replace logic using derived column. and then consider breaking your single column single data as multiple items using split() function on space. But if your any column has space in there data then that will also get splitted. Hence we should consider omit splitting of data which is wrapped insider quotes.

Once we splitted data as array then flatten that data as multiple rows and load as file. and then use this newly created file as source for further processes.

All above mentioned logic handling may be difficult and sometimes dataflows not much flexible to have this logic. So best way is having your custom code written in Azure functions to read your file data and do the replace() accordingly.

Hope this helps. Please let us know how it goes. Thank you.

--------
Please consider hitting Accept Answer. Accepted answers help community as well.

Answer

@AMJ

Can you apply the Replace function in your API query body itself? Are you using a Web Activity for API call or any via other way?

Answer

@ShaikMaheer-MSFT Thank you so much for your thoughts. Unfortunately it won't work. I modified your sample data to closely imitate the real situation as below:

As you could see there are 2 issues applying your method to a read data:

There are spaces within a column which is not surrounded by a pair of double quotes. They are legitimate in one column;
There exist double quotes within the double quotes of the text delimiter

Remove newlines inside a text field of dynamic csv columns

3 answers