how to split a csv file based on columns.

Gorakh Nath Sharma 0 Reputation points
2024-05-22T10:50:06.97+00:00

Hi Team,

There is a scenario in which we have to split columns of a CSV file into unique files in blob. for example let's say I need to create each file based on the combination of (TimeStart + TimeEnd + A) i.e. three columns per file and filename would be header of column A. second file i need (TimeStart + TimeEnd + B) i.e. three columns per file and filename would be header of column B. so for the below file 9 files need to be created . i want to create this in dynamic fashion because i don't know in which file how many columns are there , there may be 10 columns in one file and may be 6 columns in other file . but Time start and time end columns are fixed.

I can only use ADF for this. so please guide me how i can do this through ADF only.

User's image

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 41,121 Reputation points Volunteer Moderator
    2024-05-22T20:06:36.63+00:00

    You can use the Lookup activity to read the header row of the CSV file to dynamically identify the columns in the file. (Set the First Row Only option to true)

    Then, add a ForEach activity to iterate over the columns identified by the Lookup activity, where inside of it (the Foreach Activity), you add a Data Flow activity that will handle the creation of individual files for each column.

    You can set the Items property to @activity('LookupActivityName').output.value

    In the Data Flow, you read the CSV file and select the necessary columns (TimeStart, TimeEnd, and the current column from the ForEach activity).

    The Select Transformation :

    @concat('TimeStart, TimeEnd, ', item().ColumnName)
    

    The Sink Transformation :

    @concat(item().ColumnName, '.csv')
    
    

    Then, you write the data to a new CSV file in Azure Blob Storage, naming the file based on the current column name.

    Here is an example of how your ADF pipeline JSON might look:

    
    {
    
      "name": "SplitCSVPipeline",
    
      "properties": {
    
        "activities": [
    
          {
    
            "name": "LookupHeaders",
    
            "type": "Lookup",
    
            "typeProperties": {
    
              "source": {
    
                "type": "DelimitedTextSource",
    
                "dataset": {
    
                  "referenceName": "YourCSVFileDataset",
    
                  "type": "DatasetReference"
    
                }
    
              },
    
              "firstRowOnly": true
    
            }
    
          },
    
          {
    
            "name": "ForEachColumn",
    
            "type": "ForEach",
    
            "typeProperties": {
    
              "items": "@activity('LookupHeaders').output.value",
    
              "activities": [
    
                {
    
                  "name": "DataFlowActivity",
    
                  "type": "DataFlow",
    
                  "typeProperties": {
    
                    "dataflow": {
    
                      "referenceName": "SplitCSVDataFlow",
    
                      "type": "DataFlowReference"
    
                    }
    
                  },
    
                  "dependsOn": []
    
                }
    
              ]
    
            }
    
          }
    
        ],
    
        "variables": []
    
      }
    
    }
    
    
    

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.