How to handle dynamic structure of files getting read in azure synapse serverless

Priya Jha 871 Reputation points
2023-05-15T06:46:13.0466667+00:00

Hi All,

I have a requirement wherein we are using azure synapse serverless tables to read appending csv files present in azure data lake.

The schema of the files is not constant and columns can be added or deleted from the newly created files.

How can we handle this scenario wherein we would be able to read old as well new files present in data lake in azure synapse serverless tables?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,654 questions
{count} votes

Accepted answer
  1. Vedant Desai 651 Reputation points
    2023-05-15T11:32:27.3633333+00:00

    Hey Priya!
    To handle the dynamic structure of files in Azure Synapse serverless, you can follow these steps:

    1. External Table: Create an external table in Azure Synapse serverless SQL pool that references the CSV files stored in Azure Data Lake. This external table acts as a logical representation of the data in the files without physically importing them into the SQL pool.
    2. Schema Drift: Since the schema of the files can change with new columns being added or existing columns being removed, you need to handle the schema drift in your external table. By default, the external table in Azure Synapse assumes a fixed schema. However, you can use the OPENROWSET function with the WITH (FORMAT='CSV') option to handle schema drift. This allows the external table to read files with varying column structures.
    3. Column Projection: While querying the external table, you can use column projection to filter out columns that may not exist in all the files. This helps to handle scenarios where certain columns are present in some files but not in others.
    4. Flexible Querying: As the schema can change, it's essential to have flexible querying in place. You can use dynamic SQL or dynamic scripting techniques to generate and execute queries based on the available columns in each file. This allows you to dynamically adapt to the varying schema of the files.

    ETL Processes: If you require a more structured approach and need to transform the data before loading it into Azure Synapse, you can implement ETL (Extract, Transform, Load) processes using tools like Azure Data Factory or Azure Databricks. These tools provide more flexibility to handle schema drift and perform transformations on the data before loading it into the serverless SQL pool.

    By following these steps, you can handle the dynamic structure of files in Azure Synapse serverless and read both old and new files present in Azure Data Lake. Remember to plan for schema drift, use flexible querying techniques, and consider implementing ETL processes if needed.


1 additional answer

Sort by: Most helpful
  1. Nandan Hegde 30,951 Reputation points MVP
    2023-05-19T06:19:41.1333333+00:00

    Hey
    in addition to what Vedant has described, below blog explains it in detail w.r.t all scenarios :

    https://www.serverlesssql.com/schema-versioning-in-serverless-sql-pools/

    3 people found this answer helpful.
    0 comments No comments