How to fetch Schema of a file stored in Azure Blob using ADF

Madan M R 0 Reputation points
2025-04-03T07:47:02.5266667+00:00

Hi,

I'm sending data from Azure Blob to AWS S3 (As AWS S3 isn't supported as sink dataset in Copy Data Activity I'm using Mapping Dataflows).

I need to write the schema to the sink, so my pipeline has to write data in one directory and schema file in another directory.

Get metadata activity doesn't provide schema on Azure Blob source.

How to achieve this in ADF?

I appreciate any help on this.

Detailed scenario and info:

Source Dataset: Azure Blob, Json type, Document form - Document per line.

Sink Dataset: AWS S3, Parquet type, compression type - snappy.

Thanks

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator
    2025-04-04T13:58:36.8+00:00

    Hi @Madan M R
    The Derived Column transformation you tried is more for creating or modifying data columns, not for generating a separate schema file dynamically. Two Ways to Solve This

    If Your Schema Doesn’t Change Often (Static Approach)

    If your JSON file’s structure is pretty consistent, you can manually define the schema in your Data Flow:

    1. In your Derived Column transformation (derivedColumn1), create a new column (e.g., SchemaOutput) to hold the schema as a JSON string. You can write an expression like this:
         '{"columns":[{"name":"Sch_ClientBrowser","dataType":"NVARCHAR","primaryKey":true},{"name":"Sch_ClientCity","dataType":"NVARCHAR","primaryKey":true}]}'
      
      This hardcodes the schema you want (like Sch_ClientBrowser and Sch_ClientCity from your screenshot).
    2. Add a second Sink in your Data Flow: Branch off a new stream (use the “+” to add a new branch after the Derived Column). In this new Sink, write the SchemaOutput column to AWS S3 in a separate directory (e.g., s3://your-bucket/schema/Schema.json). Set the Sink format to JSON or delimited text, and make sure it writes just one file (you can set the file name to Schema.json in the Sink settings).
    3. Your existing Sink (SinkS3) can keep writing the actual data to its own directory as Parquet.
      This approach works if your schema is fixed, but if your JSON file’s structure changes often, it’s not ideal because you’d have to keep updating the hardcoded schema.

    If Your Schema Changes (Dynamic Approach)

    If your JSON file’s schema might change, we need a more flexible solution. Mapping Data Flows alone can’t dynamically extract the schema as a JSON object, but we can use an extra step in your ADF pipeline:

    Add an Azure Function to Extract the Schema

    • Create a small Azure Function (or ask a developer to help) that reads your JSON file from Azure Blob Storage and generates the schema in the format you want. For example, it can list all columns and their data types.
    • In your ADF pipeline, add an Azure Function activity before your Mapping Data Flow. This activity calls the function, gets the schema as a JSON string, and stores it in a pipeline variable.

    Pass the Schema to Your Data Flow

    • In your Mapping Data Flow, add a parameter (e.g., schemaJson) to hold the schema string from the Azure Function.
    • Use a Derived Column transformation to create a new column (e.g., SchemaOutput) and set its value to the schemaJson parameter.
    • Add a second Sink (like in Option 1) to write this SchemaOutput column to AWS S3 as Schema.json.
    • Your main data stream can still write to AWS S3 as Parquet, just like you’re doing now.

    What You’ve Already Done

    In your screenshot, I see you used the Derived Column transformation to create columns like Sch_ClientBrowser and Sch_ClientCity. That’s perfect for transforming your data, but for the schema file, we need a separate stream to write the schema JSON, as I described above.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.