How to fetch Schema of a file stored in Azure Blob using ADF

Question

How to fetch Schema of a file stored in Azure Blob using ADF

Madan M R 0

Hi,

I'm sending data from Azure Blob to AWS S3 (As AWS S3 isn't supported as sink dataset in Copy Data Activity I'm using Mapping Dataflows).

I need to write the schema to the sink, so my pipeline has to write data in one directory and schema file in another directory.

Get metadata activity doesn't provide schema on Azure Blob source.

How to achieve this in ADF?

I appreciate any help on this.

Detailed scenario and info:

Source Dataset: Azure Blob, Json type, Document form - Document per line.

Sink Dataset: AWS S3, Parquet type, compression type - snappy.

Thanks

Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-04-03T10:59:58.6633333+00:00
Hi @Madan M R> To extract the schema from a JSON file in Azure Blob Storage and write both data and schema to AWS S3 using Azure Data Factory (ADF)

Set Up Source and Sink Datasets:
Source: Azure Blob (JSON, Document per line)
Sink: AWS S3 (Parquet, Snappy compression) via Mapping Data Flow (since Copy Data Activity doesn’t support AWS S3).

Extract Schema in Mapping Data Flow: In the source transformation, import the schema from the JSON file via the Projection tab. If schema changes over time, enable Allow schema drift.

Write Schema to a Separate Directory: Use a Derived Column transformation to extract schema details. Add a second Sink transformation to write schema as a JSON or text file in AWS S3.

For more details, refer to: ADF JSON Format Documentation.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Madan M R 0 Reputation points

2025-04-04T09:07:06.3633333+00:00
Hi @Venkat Reddy Navari

In the derived column transformation, we can create new columns, is that how you are suggesting? Can you elaborate on this?

I want a file that can be generated using ADF Mapping Dataflows, which has a content like below:
Schema.json
{

“columns”:

[

{

“name”: “column_name”,

“dataType”: "NVARCHAR",

“primaryKey”: true

},

{

“name”: “column_name2”,

“dataType”: "NVARCHAR",

“primaryKey”: true

}

] }

Is this achievable?
I tried derived column transformation passing the value of columns statically through parameterization.
Madan M R 0 Reputation points

2025-04-07T07:03:50.76+00:00

@Venkat Reddy Navari Thank you for the help, if you can convert the comment to answer I can accept.
Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-04-10T09:33:31.6533333+00:00

@Madan M R Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Madan M R 0 Reputation points

2025-06-23T05:25:53.7566667+00:00

Hi @Venkat Reddy Navari ,

My source blob storage here holds 'n' number of files with same schema. If I apply derived column transformation, for each file in the data stream coming from source, will it write the same schema repeatedly to s3 sink? Or it's just a one time operation for the whole incoming stream? If the first case is true, I want the schema to be written only once when the pipeline runs, not each time there's a json file in the stream.

1 answer

Your answer

Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-04-03T10:59:58.6633333+00:00

Hi @Madan M R> To extract the schema from a JSON file in Azure Blob Storage and write both data and schema to AWS S3 using Azure Data Factory (ADF)

Set Up Source and Sink Datasets:
Source: Azure Blob (JSON, Document per line)
Sink: AWS S3 (Parquet, Snappy compression) via Mapping Data Flow (since Copy Data Activity doesn’t support AWS S3).

Extract Schema in Mapping Data Flow: In the source transformation, import the schema from the JSON file via the Projection tab. If schema changes over time, enable Allow schema drift.

Write Schema to a Separate Directory: Use a Derived Column transformation to extract schema details. Add a second Sink transformation to write schema as a JSON or text file in AWS S3.

For more details, refer to: ADF JSON Format Documentation.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Madan M R 0 Reputation points

2025-04-04T09:07:06.3633333+00:00

Hi @Venkat Reddy Navari

In the derived column transformation, we can create new columns, is that how you are suggesting? Can you elaborate on this?

I want a file that can be generated using ADF Mapping Dataflows, which has a content like below:
Schema.json
{

“columns”:

[

{

“name”: “column_name”,

“dataType”: "NVARCHAR",

“primaryKey”: true

},

{

“name”: “column_name2”,

“dataType”: "NVARCHAR",

“primaryKey”: true

}

] }

Is this achievable?
I tried derived column transformation passing the value of columns statically through parameterization.
Madan M R 0 Reputation points

2025-04-07T07:03:50.76+00:00

@Venkat Reddy Navari Thank you for the help, if you can convert the comment to answer I can accept.
Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-04-10T09:33:31.6533333+00:00

@Madan M R Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Madan M R 0 Reputation points

2025-06-23T05:25:53.7566667+00:00

Hi @Venkat Reddy Navari ,

My source blob storage here holds 'n' number of files with same schema. If I apply derived column transformation, for each file in the data stream coming from source, will it write the same schema repeatedly to s3 sink? Or it's just a one time operation for the whole incoming stream? If the first case is true, I want the schema to be written only once when the pipeline runs, not each time there's a json file in the stream.

Answer 1

Hi @Madan M R
The Derived Column transformation you tried is more for creating or modifying data columns, not for generating a separate schema file dynamically. Two Ways to Solve This

If Your Schema Doesn’t Change Often (Static Approach)

If your JSON file’s structure is pretty consistent, you can manually define the schema in your Data Flow:

In your Derived Column transformation (derivedColumn1), create a new column (e.g., SchemaOutput) to hold the schema as a JSON string. You can write an expression like this:
```
   '{"columns":[{"name":"Sch_ClientBrowser","dataType":"NVARCHAR","primaryKey":true},{"name":"Sch_ClientCity","dataType":"NVARCHAR","primaryKey":true}]}'
```
This hardcodes the schema you want (like Sch_ClientBrowser and Sch_ClientCity from your screenshot).
Add a second Sink in your Data Flow: Branch off a new stream (use the “+” to add a new branch after the Derived Column). In this new Sink, write the SchemaOutput column to AWS S3 in a separate directory (e.g., s3://your-bucket/schema/Schema.json). Set the Sink format to JSON or delimited text, and make sure it writes just one file (you can set the file name to Schema.json in the Sink settings).
Your existing Sink (SinkS3) can keep writing the actual data to its own directory as Parquet.
This approach works if your schema is fixed, but if your JSON file’s structure changes often, it’s not ideal because you’d have to keep updating the hardcoded schema.

If Your Schema Changes (Dynamic Approach)

If your JSON file’s schema might change, we need a more flexible solution. Mapping Data Flows alone can’t dynamically extract the schema as a JSON object, but we can use an extra step in your ADF pipeline:

Add an Azure Function to Extract the Schema

Create a small Azure Function (or ask a developer to help) that reads your JSON file from Azure Blob Storage and generates the schema in the format you want. For example, it can list all columns and their data types.
In your ADF pipeline, add an Azure Function activity before your Mapping Data Flow. This activity calls the function, gets the schema as a JSON string, and stores it in a pipeline variable.

Pass the Schema to Your Data Flow

In your Mapping Data Flow, add a parameter (e.g., schemaJson) to hold the schema string from the Azure Function.
Use a Derived Column transformation to create a new column (e.g., SchemaOutput) and set its value to the schemaJson parameter.
Add a second Sink (like in Option 1) to write this SchemaOutput column to AWS S3 as Schema.json.
Your main data stream can still write to AWS S3 as Parquet, just like you’re doing now.

What You’ve Already Done

In your screenshot, I see you used the Derived Column transformation to create columns like Sch_ClientBrowser and Sch_ClientCity. That’s perfect for transforming your data, but for the schema file, we need a separate stream to write the schema JSON, as I described above.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-04-07T15:37:32.58+00:00

Hi @Madan M R
Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-04-08T11:13:50.7933333+00:00

Hi Madan M R Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

How to fetch Schema of a file stored in Azure Blob using ADF

1 answer

Your answer