ADF/Synapse Loop through Json files with different structures

WZIAFP 157 Reputation points
2022-02-04T11:18:49.497+00:00

I have a large number of JSON files within my ADLS Storage account that need to processed into an Synapse SQL Pool. Each JSON file represents a dataset that will be destined for their own sql table.
The core JSON format is below however the values in the results node differ. Below is an example of two different files.

File 1

{
 "results": [
 {
 "COL1": 0.0,
 "COL2": "",
 "COL3": "",
 "COL4": 123,
 "COL5": ""
 }
]}

File 2

    {
    "results": [
    {
    "COL21": 125820.201,
    "COL22": "",
    "COL23": "",
    "COL24": 0.5,
    "COL25": "",
    "COL26": "Free Text",
    "COL27": "",
    "COL28": "Description",
    "COL29": "Provider",
    "COL31": "Test Account",
    "COL32": 2342345,
    "COL33": "GBP",
    }
    ]}
    

I am comfortable with creating individual DF activity to flatten the data and insert into a table. however due to the number of json files that need to be processed I would end up with over 75 pipelines which would become unmanageable.

What steps would i need to do to make this dynamic?
Is ADF the best tool to achieve the outcome? I have very limited experience in Python but believe that another possible solution would be to use python to convert the json to parquet and then ingest into synapse.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,917 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,680 questions
{count} votes

1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,086 Reputation points
    2022-02-08T01:15:04.353+00:00

    I think I may have to eat my words on this. @Anonymous

    After much fiddling, I found 80% success. Partial success because I needed to define part of the schema. In order for Flatten to work, the thing you unroll by must be defined in the schema. In order for results to be recognized as a complex object, it must have at least one property. This means if you have one property field common to all the files like "ID", we can use this in the schema, and leverage schema drift to handle everything else.

    If you want to understand what I mean, import the schema, then go to the script and delete some of the members under 'results'. Then go enable 'allow schema drift' and maybe 'detect drifted types' and check the source preview.

    To catch all the unknown columns while Flattening, I had to use a rule-based deep traversal. I set the match condition to true() so everything would be caught, and preserved the names by using $$ for Name as. With enabling deep traversal I set Hierarchy level to results

    171949-image.png

    In the sink mapping, I used automapping option.

    I also tried with Copy activity. That didn't pan out. defining the cross apply signals the mapping is defined, turning off the automatic mapping.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.