Access all files from folder and subfolders in blob storage by pipeline/data-flow

Gerald Rupp 130 Reputation points
2023-02-20T16:51:57.5066667+00:00

Hi everybody,

I have a folder in the blob storage that contains many subfolders with json files. Each of these json files has the attribute "Type". The attribute type can be "A", "B" or "C". I want to filter the json files by these types and store them in the blob storage folder "Type_A", "Type_B" or "Type_C".

My problem is to access all the json files, from the actual folder and(!) from the subfolders and conduct a data flow for filtering.

I tried to use getmetadata*,* but my subfolders have several layers: year - month - day - json_files

I tried to implement for_each, but I cannot implement a for_activity inside a for_each-Activity

Does anybody has another idea?
Thanks a lot for your help.

Kind regards,
Gerald

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,651 questions
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-02-22T06:47:10.87+00:00

    Hi @Gerald Rupp ,

    Welcome to Microsoft Q&A forum and thanks for reaching out here.

    As per my understanding you have json files located in your storage account in the below folder structure and sample file looks like below, please correct me if I'm wrong.

    Folder Structure:

    Container

     Root
    
         Year
    
            Month
    
               Date
    
                     File1.json
    
                     ....
    
                     ....
    
                    FileN.json
    

    Assuming your sample file looks like below:

    [{
        "Type": "A",
        "Data": {
            "Attribute11" : "Value11",
            "Attribute21" : "Value21"
        }
    }, 
    {
        "Type": "B",
        "Data": {
            "Attribute31" : "Value31",
            "Attribute41" : "Value41"
        }
    }, 
    {
        "Type": "C",
        "Data": {
            "Attribute51" : "Value51",
            "Attribute61" : "Value61"
        }
    }
    ]
    

    And you would like to seperate the data based on the Type Attribute (which could be A, B or C) and create a folder with that name and save them in that folder. Please correct me if I'm wrong anywhere.

    To achieve the above requirement, the best way is to use Mapping data flow as it will reduce the complexity and also you can transform the data as per your custom requirement.

    Steps to be followed: In dataset configuration, just provide the container name and leave directory and filename empty as you will configure them in data flow using wild card path.

    User's image

    Then in Mapping data flow source settings, please select source options as below using wild card paths and provide Partition root path and select the document form of your source.

    User's image

    Next, have a conditional split transformation, which will be used to split the data in all your source files based on each Type Attribute and write them to respective folders in your storage account.

    User's image

    Then have Select transformation to select only the columns that you would like to copy to each sink folder based on the Type Attribute as shown above.

    Hope this helps. Kindly let us know if your requirement is different than my understanding. If that's case, please share few additional details about the requirement and we would be happy to assist accordingly.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.