Split single column data into multiple columns in datafactory

Question

Split single column data into multiple columns in datafactory

Vaibhav 105

Issue:
I have some csv files in sftp location .
I was planning to use copyactivity to load files from sftp location to datalake for archiving as well as upserting in deltalake. However, I am getting error while reading from source, as some file has delimiter (|) in the value itself and there is no escape character.( I cannot change the source data)

Solution :
I thought to read the file without any delimiter so that all the value will be in a single column and then load in same format in datalake. Then use dataflow to read from datalake source and use replace () to replace delimiter (|) to hyphen and then split the values to multiple columns and make it similar to source csv.

1 . using no delimiter option in the dataset to bring all the column value in single column. User's image

Now , I am replacing the delimiter present in the value with hyphen Ex - replace ('aaaa|bbb','aaaa-bbb'). I know the exact value having issue.
Then, splitting the column into its original columns.

User's image

Then I am trying to sink the data in csv format.

Questions:

Is there a better way to solve this issue without using additional services in adf?
When using above solution, then In the 3rd step , I have to give the column names with expression split (col,'|')[1],split (col,'|')[2].. Can we avoid giving columns names and do it dynamically so that if any other file have similar issue then I can just use this dataflow with some parameters.
In sink location, multiple files are getting generated even though I have used current partitioning. I want my files in sink to be in similar number to the sftp(source) files. If there are 5 files ingested from sftp then sink location should also have 5 files. One of the purpose of the pipeline is to archive the sftp files in the datalake.
I will also use the same pipeline to upsert the data into deltalake. But I think if above issues are resolved the upserting should not be a problem. Kindly suggest.

Smaran Thoomu 24,185 Reputation points Microsoft External Staff Moderator

2024-05-07T12:01:12.04+00:00

@Vaibhav We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Smaran Thoomu 24,185 Reputation points Microsoft External Staff Moderator

2024-05-07T12:01:12.04+00:00

@Vaibhav We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi @Vaibhav

Thank you for reaching out to us with your query related to loading CSV files from an SFTP location to a data lake using Azure Data Factory (ADF). I understand that you are facing issues with the delimiter in the source data and have come up with a solution to read the file without any delimiter, load it into the data lake, use data flow to replace the delimiter with a hyphen, split the values into multiple columns, and sink the data in CSV format. I'll be happy to assist you with your questions related to this solution.

Regarding your first question, there is no direct way to handle this issue in ADF without using additional services. However, you can try using the "Text Delimiter" option in the dataset to specify a different delimiter that is not present in the source data. This will ensure that the data is loaded correctly into the data lake without any issues. You can also try using the "Escape Character" option to escape the delimiter in the source data.

For your second question, you can use dynamic column mapping in the data flow to avoid giving column names. You can use the "Derived Column" transformation to split the column into multiple columns dynamically. You can also use parameters in the data flow to make it more flexible and reusable.

Regarding your third question, you can use the "File Name" option in the sink dataset to specify the file name dynamically based on the source file name. You can also use the "Partition by" option in the sink dataset to partition the data based on a specific column. This will ensure that the data is written to the correct file and partitioned correctly.

I hope this information helps you. If you have any further questions or concerns, please feel free to ask. We are always here to assist you.

Vaibhav 105 Reputation points

2024-05-09T09:40:10.7033333+00:00

Hi @Smaran Thoomu

Thanks for the response.
Can you guide me how to achieve the 2nd question solution.?

""When using above solution, then In the 3rd step , I have to give the column names with expression split (col,'|')[1],split (col,'|')[2] and so on. Can we avoid giving columns names and do it dynamically so that if any other file with diff schema have similar issue then I can just use this dataflow with some parameters.

""
Smaran Thoomu 24,185 Reputation points Microsoft External Staff Moderator

2024-05-10T05:15:16.5266667+00:00

Hi Vaibhav,

Sure, I can guide you on how to achieve dynamic column mapping in the data flow to split the column into multiple columns without giving column names. Here are the steps:

In the data flow, add a "Derived Column" transformation after the source transformation.

In the "Derived Column" transformation, add a new column and give it a name. For example, "SplitColumn".

In the expression builder, use the "split" function to split the column into multiple columns dynamically. For example, if the column name is "Column1" and the delimiter is "|", the expression would be: split(Column1, '|') This will split the column into an array of values.

Use the "Select" transformation to select the "SplitColumn" column and expand it to multiple columns. You can use the "Select" transformation to rename the columns as well.

In the sink transformation, map the columns dynamically using the "Auto Mapping" option. This will automatically map the columns based on their names.

You can use parameters in the data flow to make it more flexible and reusable. For example, you can use a parameter to specify the delimiter and use it in the "split" function.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Split single column data into multiple columns in datafactory

1 answer

Your answer