Copy Multiple files into ADL Gen2

Ramya Harinarthini_MSFT 5,306 Reputation points Microsoft Employee
2020-05-11T06:31:21.043+00:00

I have a Data Factory pipeline that currently copies files daily from a Google Storage account down to an Azure Storage Blob ADL Gen2 enabled.

Source several different files, File1, File2, File3 etc, all have a data range in the file name File1_20200101_20200102.csv.gzip and they are .csv and zipped.

I was able to connect and use a Binary source and Binary target and just grab all files that were created/modified yesterday. Also part of the target, I unzip the files so they are just .csv.

I want to make sure that I'm setting up the structures correctly in the blob storage for it to function as a DL.

BlobContainer1/RAW/GoogleSource/File1_20200101_20200102.csv.gzip

From what I'm reading, I should probably have BlobContainer1/RAW/GoogleSource/File1/{year}/{month}/{day}/File1_20200101_20200102.csv.gzip, would that be correct?

If so, is it possible to dynamically determine the folder path based on each file name that is being pulled in, OR, do I have to create a separate copy pipeline for each File that is being copied over?

[Note: As we migrate from MSDN, this question has been posted by an Azure Cloud Engineer as a frequently asked question]

MSDN Source: Copy Multiple files into ADL Gen2

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,335 questions
0 comments No comments
{count} votes

Accepted answer
  1. ChiragMishra-MSFT 951 Reputation points
    2020-05-11T06:34:39.613+00:00

    Welcome to the Microsoft Q&A (Preview) platform.

    Happy to answer your query.

    It sounds like you want to have your data partitioned similarly to how the Hadoop or Synapse saves the data. To do this, I recommend you use Mapping Data Flows, as this has partitioning options used by distributed computing.

    MSDN Source : Copy Multiple files into ADL Gen2

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Utkarsh Sharma 41 Reputation points
    2020-05-30T18:44:02.26+00:00

    Hi,

    As you have ADLS Gen2 enabled, I would recommend you to use Azure Data Factory to create folders in your Storage account. You can use copy activity and extract the "Year", "Month" and "Day" part from your source file and create a hierarchy at the destination i.e. ADLS Gen2.

    You may also refer below to create partitions:

    {
    "name": "AzureOutput",
    "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "ADLSLinkedService",
    "typeProperties": {
    "folderPath": "BlobContainer1/RAW/GoogleSource/File1/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
    "partitionedBy": [
    {
    "name": "Year",
    "value": {
    "type": "DateTime",
    "date": "SliceStart",
    "format": "yyyy"
    }
    },
    {
    "name": "Month",
    "value": {
    "type": "DateTime",
    "date": "SliceStart",
    "format": "%M"
    }
    },
    {
    "name": "Day",
    "value": {
    "type": "DateTime",
    "date": "SliceStart",
    "format": "%d"
    }
    }
    ]

    0 comments No comments