Split single column data into multiple columns in datafactory

Vaibhav 105 Reputation points
2024-05-03T18:07:32.02+00:00

Issue:
I have some csv files in sftp location .
I was planning to use copyactivity to load files from sftp location to datalake for archiving as well as upserting in deltalake. However, I am getting error while reading from source, as some file has delimiter (|) in the value itself and there is no escape character.( I cannot change the source data)

Solution :
I thought to read the file without any delimiter so that all the value will be in a single column and then load in same format in datalake. Then use dataflow to read from datalake source and use replace () to replace delimiter (|) to hyphen and then split the values to multiple columns and make it similar to source csv.

1 . using no delimiter option in the dataset to bring all the column value in single column.User's image

  1. Now , I am replacing the delimiter present in the value with hyphen Ex - replace ('aaaa|bbb','aaaa-bbb'). I know the exact value having issue.
    User's image
  2. Then, splitting the column into its original columns.

User's image

  1. Then I am trying to sink the data in csv format.

Questions:

  1. Is there a better way to solve this issue without using additional services in adf?
  2. When using above solution, then In the 3rd step , I have to give the column names with expression split (col,'|')[1],split (col,'|')[2].. Can we avoid giving columns names and do it dynamically so that if any other file have similar issue then I can just use this dataflow with some parameters.
  3. In sink location, multiple files are getting generated even though I have used current partitioning. I want my files in sink to be in similar number to the sftp(source) files. If there are 5 files ingested from sftp then sink location should also have 5 files. One of the purpose of the pipeline is to archive the sftp files in the datalake.
  4. I will also use the same pipeline to upsert the data into deltalake. But I think if above issues are resolved the upserting should not be a problem. Kindly suggest.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,630 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,185 Reputation points Microsoft External Staff Moderator
    2024-05-06T06:40:51.2666667+00:00

    Hi @Vaibhav

    Thank you for reaching out to us with your query related to loading CSV files from an SFTP location to a data lake using Azure Data Factory (ADF). I understand that you are facing issues with the delimiter in the source data and have come up with a solution to read the file without any delimiter, load it into the data lake, use data flow to replace the delimiter with a hyphen, split the values into multiple columns, and sink the data in CSV format. I'll be happy to assist you with your questions related to this solution.

    Regarding your first question, there is no direct way to handle this issue in ADF without using additional services. However, you can try using the "Text Delimiter" option in the dataset to specify a different delimiter that is not present in the source data. This will ensure that the data is loaded correctly into the data lake without any issues. You can also try using the "Escape Character" option to escape the delimiter in the source data.

    For your second question, you can use dynamic column mapping in the data flow to avoid giving column names. You can use the "Derived Column" transformation to split the column into multiple columns dynamically. You can also use parameters in the data flow to make it more flexible and reusable.

    Regarding your third question, you can use the "File Name" option in the sink dataset to specify the file name dynamically based on the source file name. You can also use the "Partition by" option in the sink dataset to partition the data based on a specific column. This will ensure that the data is written to the correct file and partitioned correctly.

    I hope this information helps you. If you have any further questions or concerns, please feel free to ask. We are always here to assist you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.