How to handle dynamic schema for separate header and data files

Priya Jha 851 Reputation points


I am receiving 2 separate files, a header json file and a data file in my ADLS. I want to read these 2 files using a single query in Azure Synapse serverless pool.

Also, if schema of the file changes, how can we have union of all files of header and data and read these all files using a query in Azure Synapse serverless when schema is dynamic?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
3,540 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 4,081 Reputation points

    Step 1 :

    Ensure that your files have a consistent naming or storage pattern. For example, header files can have a naming pattern like header_YYYYMMDD.json and the corresponding data file as data_YYYYMMDD.csv.

    You can use the OPENROWSET function to read files in Serverless SQL. Here's an example on how you might read a header and data file:

    -- Read header
    SELECT *
            BULK 'adftutorial/salessmall/header_YYYYMMDD.json',
            FORMAT = 'JSON'
    ) AS headers
    -- Read data
    SELECT *
            BULK 'adftutorial/salessmall/data_YYYYMMDD.csv',
            FORMAT = 'CSV',
            FIELDTERMINATOR = ',',
            ROWTERMINATOR = '\n',
            FIRSTROW = 2
    ) AS data

    To deal with the dynamic schema, consider the following:

    1. Header File as the Schema Source: Treat the header file as a source of truth for your schema. Before querying the data file, parse the header file to determine the structure of the data file.
    2. Query Generation: Once you have the schema from the header, you can generate the required query dynamically to fetch data based on the schema.
    3. Schema Evolution: When unioning data files with different schemas, you'll need to identify missing columns in each dataset and then add them with default or null values. This way, you can generate a union-compatible schema for all files.

    Step 2 :

    To union files, especially when the schema might change, you can perform a union operation with the assumption that the columns might not be the same across all files. You'd likely need a dynamic SQL generation mechanism, perhaps outside of Synapse, to adjust the SELECT statement columns according to the schema present in each header.

    For instance:

    1. Parse each header file to get a list of columns.
    2. Use that column list to build a dynamic SQL query for each data file.
    3. Introduce null values for missing columns.
    4. Use UNION ALL to combine all datasets.

    Don't forget that :

    • Dealing with dynamic schemas can introduce complexity, especially if the changes in the schema are frequent or vast.
    • Repeatedly generating dynamic SQL, parsing headers for each query, and other operations can lead to performance overheads.
    • Depending on the complexity and frequency of schema changes, you might want to consider third-party ETL or data integration tools that can better handle schema drift without manual intervention.
    0 comments No comments