Azure Data Factory Search and Replace tokens in CSV file

Question

Azure Data Factory Search and Replace tokens in CSV file

Andrew Macinnes 6

Hi,

I have a csv file that has over 40 columns. In these fields there can be but not always, one of 10+ tokens that I want to replace.

What is the best way to search through the whole file and replace those tokens with another value? Each token has a different value that I want to replace it with.

I know I can use the replace function on a derived column schema modifier to search and replace for a string on a column by column basis (See below). This will only allow me to search for one token at a time when I need to search for 10+ tokens and replace with 10+ values.

I also don't want to have to have to create over 40 entries in the derived column schema modifier and I imagine there is a better way to achieve what i am looking for?

MartinJaffer-MSFT 26,236 Reputation points

2022-05-27T16:37:59.843+00:00
Hello @Andrew Macinnes and welcome to Microsoft Q&A.

So I understand you want to do a replace across many columns, without entering each column as its own expression. To better narrow down the options, could you clarify / verify the points below?

Can the same token appear in multiple columns, or is the token specific to a column?

Can multiple tokens appear in the same cell (column of a row), or is it maximum of one token per cell? Can it be duplicate token, or always different?

Is the token replacement 1-to-1? That is, is there exactly one possible replacement value for each original token value?

Can a token appear in unexpected location and not be intended for replacement?

Are the tokens the sole value in the cell, or do we need to do substring search and preserve surrounding content?

Are there groups of tokens which have the same replacement value?

roughly how many unique tokens are there?

roughly what is the size range of the file, how many rows (hundred, thousand, million, more)?

The aim of these questions is to determine whether we can use the same replacement code on all columns, or whether we need to track any dependencies.
Some replacement techniques, only replace the first occurrence before moving on to the next item.

Some of the tools available are, RegexReplace , map, rlike.
There are the possibilities of operating on each row as a string, each row as an array of strings, each column on its own, column patterns.
Andrew Macinnes 6 Reputation points

2022-06-01T13:29:49.087+00:00
Thanks for getting back to me @MartinJaffer-MSFT

To answer your questions :-

Can the same token appear in multiple columns, or is the token specific to a column?

Yes the same token can appear in multiple columns

Can multiple tokens appear in the same cell (column of a row), or is it maximum of one token per cell? Can it be duplicate token, or always different?

Yes multiple tokens can appear in the same cell and cell could contain multiple instance of the same token

Is the token replacement 1-to-1? That is, is there exactly one possible replacement value for each original token value?

The token replacement is 1-to-1

Can a token appear in unexpected location and not be intended for replacement?

All tokens must be replaced with their replacement value

Are the tokens the sole value in the cell, or do we need to do substring search and preserve surrounding content?

The tokens are not the sole value in the cell. We do need to do substring search and preserve the surrounding content

Are there groups of tokens which have the same replacement value?

Some tokens may share the same replacement value

-roughly how many unique tokens are there?

roughly 25-30

roughly what is the size range of the file, how many rows (hundred, thousand, million, more)?

Files range from few hundred rows to max of around 50,000 rows

1 answer

Your answer

MartinJaffer-MSFT 26,236 Reputation points

2022-05-27T16:37:59.843+00:00

Hello @Andrew Macinnes and welcome to Microsoft Q&A.

So I understand you want to do a replace across many columns, without entering each column as its own expression. To better narrow down the options, could you clarify / verify the points below?

Can the same token appear in multiple columns, or is the token specific to a column?

Can multiple tokens appear in the same cell (column of a row), or is it maximum of one token per cell? Can it be duplicate token, or always different?

Is the token replacement 1-to-1? That is, is there exactly one possible replacement value for each original token value?

Can a token appear in unexpected location and not be intended for replacement?

Are the tokens the sole value in the cell, or do we need to do substring search and preserve surrounding content?

Are there groups of tokens which have the same replacement value?

roughly how many unique tokens are there?

roughly what is the size range of the file, how many rows (hundred, thousand, million, more)?

The aim of these questions is to determine whether we can use the same replacement code on all columns, or whether we need to track any dependencies.
Some replacement techniques, only replace the first occurrence before moving on to the next item.

Some of the tools available are, RegexReplace , map, rlike.
There are the possibilities of operating on each row as a string, each row as an array of strings, each column on its own, column patterns.
Andrew Macinnes 6 Reputation points

2022-06-01T13:29:49.087+00:00

Thanks for getting back to me @MartinJaffer-MSFT

To answer your questions :-

Can the same token appear in multiple columns, or is the token specific to a column?

Yes the same token can appear in multiple columns

Can multiple tokens appear in the same cell (column of a row), or is it maximum of one token per cell? Can it be duplicate token, or always different?

Yes multiple tokens can appear in the same cell and cell could contain multiple instance of the same token

Is the token replacement 1-to-1? That is, is there exactly one possible replacement value for each original token value?

The token replacement is 1-to-1

Can a token appear in unexpected location and not be intended for replacement?

All tokens must be replaced with their replacement value

Are the tokens the sole value in the cell, or do we need to do substring search and preserve surrounding content?

The tokens are not the sole value in the cell. We do need to do substring search and preserve the surrounding content

Are there groups of tokens which have the same replacement value?

Some tokens may share the same replacement value

-roughly how many unique tokens are there?

roughly 25-30

roughly what is the size range of the file, how many rows (hundred, thousand, million, more)?

Files range from few hundred rows to max of around 50,000 rows

Answer 1

Andrew Macinnes 6

To give a bit of context, I am looking to find and replace tokens representing special characters such as carriage returns and line feeds. An example is
[<000010>} which represents a line feed.

What I have done and it may not be the most efficient method, is to process each column using multiple derived columns to do multiple passes of replacing strings. This allows me to replace multiple different tokens appearing in individual columns/cells.
I use a column pattern to process all columns of type string and do a string replace.

It wont let me upload any images but my solution it similar to the answer in the link below.

https://stackoverflow.com/questions/72393036/azure-data-factory-search-and-replace-tokens-in-csv-file

The difference being my case statement is something like :-

case( like($$, '%[<000013>]%'),replace($$,'[<000013>]','\r'),
like($$, '%[<000010>]%'),replace($$,'[<000010>]','\n'),
like($$, '%[<000034>]%'),replace($$,'[<000034>]','BBBBBB'),
.
.
like($$, '%[<000039>]%'),replace($$,'[<000039>]','CCCCCC'), $$)

MartinJaffer-MSFT 26,236 Reputation points

2022-06-08T17:36:07.077+00:00

My apologies for the delay, this got a little lost in my shuffle. Are you satisfied with the solution you found?

It does not seem like the most efficient of solutions, but it will work.

Instead, I would recommend building up a mapping. Then using the Map function on each column, splitting the string up into token-sized components and doing everything in one sweep.

To help understand the difference, the mapping in "building up a mapping" is different from the map function.

In the first case it means a collection of key-value pairs intended for swapping values.
[
'[<000010>]' => '\n',
'[<000034>]' => 'BBBBBB',
]

The map function is intended to apply a user-defined transformation on each element of an array. The array in this case would be the string column, broken up somehow (split function). The transformation in question would be to look up in our mapping, and if a match it found, do a substitution, otherwise keep the original value.

The advantage here, is you only need to build up the mapping once, making maintenance easier, and less code. The map function will hit each element, so we only need to do this derived column once. Also reusable for each column, and letting you do rule-based column selection.

Share via

Azure Data Factory Search and Replace tokens in CSV file

1 answer

Your answer