How to store the source file name as a new column

Question

Seeking help on the data factory setup. Appreciate if I can get help here.

Currently I have a files name as Sales_20230810_020054 (___), and it will generated in daily basis and store in azure storage account.

In the storage account, we have this folder hierarchy

Sales

---Data_Source

------20230810

---------Sales_20230810_020054

---Data_Output

------20230810

You can expect there will be a new sales file put it in the folder each day

I have following questions when I am setting up the azure pipeline via this situation

How can i dynamically locate this pattern of file ___ in the dataset setting. Can I only setup one dataset for that?
How can i extract the filename (Sales_20231004_020054) and append it to the daily sales file as a new column
Can I setup one pipeline to achieve the above two goals or need separate pipeline?

Answer

Hey,

To answer to your queries:

You can create a parameterized dataset wherein the file path would be dynamic and you can pass the variable values at run time.

You can use set variable activity initially to set the file name and file path that you would be expecting for that iteration.

As you yourself would be declaring the filename via set variable activity initially, you can use that and leverage additional column functionality within the copy activity:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
yes, the same pipeline can have end to end integration

Answer

Hi @Charlie Welcome to Microsoft Q&A platform and thanks for posting your question here.

As I understand that you're looking for help setting up an Azure pipeline to dynamically locate a file pattern in a dataset, extract the filename, and append it to the daily sales file as a new column. I'll be happy to assist you with that.

To answer your first question, you can dynamically locate the file pattern in a dataset by using a wildcard in the dataset's file path. A wildcard is a character that can represent one or more characters in a file or folder name. In your case, you can use the asterisk (*) wildcard to represent the variable parts of the file name, such as the date and time.
Example to setup the dataset with a wildcard:

{
    "name": "SalesDataset",
    "properties": {
        "linkedServiceName": "AzureBlobStorageLinkedService",
        "folderPath": "Sales/Data_Source",
        "fileName": {
            "value": "Sales_*_*_*",
            "type": "WildcardFileName"
        },
        "format": {
            "type": "TextFormat",
            "columnDelimiter": ",",
            "firstRowAsHeader": true
        }
    }
}

In this example, the file name pattern is set to "Sales__", which means that the dataset will match any file name that starts with "Sales*" and has three wildcard segments separated by underscores. The three wildcard segments represent the year, month, day, hour, minute, and second parts of the file name.

To answer your second question, you can extract the filename and append it to the daily sales file as a new column by using a derived column transformation in the data pipeline. A derived column transformation allows you to add a new column to the dataset and set its value based on an expression.

Example of how you can add a new column to the dataset with the filename:

{
    "name": "SalesPipeline",
    "properties": {
        "activities": [
            {
                "name": "SalesCopyActivity",
                "type": "Copy",
                "inputs": [
                    {
                        "name": "SalesDataset"
                    }
                ],
                "outputs": [
                    {
                        "name": "SalesOutputDataset"
                    }
                ],
                "copy": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "BlobSink"
                    },
                    "transformations": [
                        {
                            "name": "AddFilenameColumn",
                            "type": "DerivedColumn",
                            "dependsOn": [],
                            "userProperties": [],
                            "typeProperties": {
                                "columns": [
                                    {
                                        "name": "Filename",
                                        "expression": "@substring(activity('SalesCopyActivity').output.firstRow.FileName, add(indexOf(activity('SalesCopyActivity').output.firstRow.FileName, '_'), 1), sub(indexOf(activity('SalesCopyActivity').output.firstRow.FileName, '_', add(indexOf(activity('SalesCopyActivity').output.firstRow.FileName, '_'), 1)), indexOf(activity('SalesCopyActivity').output.firstRow.FileName, '_', add(indexOf(activity('SalesCopyActivity').output.firstRow.FileName, '_', add(indexOf(activity('SalesCopyActivity').output.firstRow.FileName, '_'), 1))), 1)))"
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    }
}

In this example, the data pipeline includes a copy activity that copies the data from the SalesDataset to the SalesOutputDataset. The copy activity also includes a derived column transformation that adds a new column called "Filename" to the dataset and sets its value based on an expression. The expression uses the substring, indexOf, and add functions to extract the filename from the FileName column and remove the date and time parts.

To answer your third question, you can set up one pipeline to achieve both goals. The example pipeline I provided includes both the dataset and the data pipeline activities to copy the data and add the filename column.

Reference: Derived column transformation in mapping data flow
I hope this information helps you set up your Azure pipeline. Let me know if you have any further questions or need additional assistance.

Share via

How to store the source file name as a new column

2 answers