Cleanse CSV data in data factory pipeline

Dave Gray 586 Reputation points
2023-08-03T08:40:42.9766667+00:00

Hello

I have an issue with non-printable characters in a column (in a CSV file) are breaking my ADF pipeline copy activity. I have approx 40 columns and when these characters are encountered the pipeline execution thinks there are more columns than it has mapped and bombs out.

The data I'm ingesting comes from a 3rd party source (ServiceNow) and the rogue column is a free text field where the users can enter text and also any number of non-printable characters. This is an example error, which occurred 258 rows into the input...

Operation on target Copy IncidentText failed: ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'incidentTEXT.csv' with row number 258: found more columns than expected column count 2.,Source=Microsoft.DataTransfer.Common,'

Looked at the data in a Hex viewer and it shows these unprintable characters in the offending row.

User's image

Is there a way I can cleanse this column before the copy task within ADF or will I need to write an Azure function to run before this to remove TAB, CR, LF etc?

Many thanks

Dave

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-08-03T21:59:54.5566667+00:00

    @Dave Gray Welcome to Microsoft Q&A forum and thanks for reaching out here.

    I would recommend using Skip incompatible rows feature in copy activity and log those bad records into a different file and continue to process all the good records. Once the good records are processed, then you can have a different custom application or an Azure function to cleanse the data for the rows that have non-printable characters and then reprocess those particular set of records. If you plan to build an Azure Function, then after the copy activity execution, you could have an Azure function activity to perform the data cleanup for the bad records that were logged using copy fault tolerance and reprocess those records.
    Shows the UI for a Copy Data activity on the Settings tab with the Fault Tolerance configuration highlighted.

    Here is the doc related to fault tolerance feature in copy activity: Supported Scenarios - Fault tolerance of copy activity in Azure Data Factory and Synapse Analytics pipelines

    User's image

    Hope this helps.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.


1 additional answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-08-07T19:03:01.13+00:00

    @Dave Gray Thanks for your response. Unfortunately, Copy activity in ADF doesn't have an out of box feature to skip the column automatically unless we exclude the column from explicit mapping manually if we know that particular columns which could cause the issue. Hence the answer is no.

    In case if there is no control over the source file generating application to alter these columns or column data, then it would be good to go for Azure functions or any other custom applications to get rid of those columns as this is data issue which is beyond ADF out of box features.

    Problem: As per the error message "more columns than expected" which means that those non-printable character columns having values with your column delimiters which is why the above error message is thrown. if there is a column delimiter part of your column value, then that row will have more column than expected. To avoid that issue, either source data needs to be fixed or each column value to be enclosed in a special character that is not part of any of the column values or the non-printable values in your columns that has issue.

    In such case, the source data needs to be cleaned/fixed upfront before processing through ADF Copy activity.

    This issue is similar to the issues discussed below, please review for additional context of the problem:

    Error code: DelimitedTextMoreColumnsThanDefined

    CSV Delimited Text More Columns Than Defined - Data Factory

    Parsing CSV failing from sftp to blob

    Hope this information helps. Please let me know if you have any further questions.

    We have also received your feedback that the answer provided on the thread was not helpful. I have provided a detailed answer which has detailed explanation about the problem and how to handle it.

    Kindly let us know what we could have done better to improve the answer and make your engagement experience good. We are here to help you and strive to make your experience better and greatly value your feedback.

    If you wish, you may consider re-surveying/rating for the engagement you received on the thread. Your feedback is very important to us.

    Looking forward to you reply. Much appreciate your feedback!


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.