Synapse analytics: Data Flow Activity: Why the last column gets dismissed in source schema import?

Rub Ire Pos 0 Reputation points
2023-02-21T17:05:54.4966667+00:00

Hi

I'm working on a data ingestion task in Synapse Analytics.

I have an issue with schema import while reading a parquet file using a Data Flow Activity. 

I can see all the columns when choosing 'View Schema', the last column is 'YYYYMM'.

User's image

But when I choose 'Import schema', I lose the last column and only get all the columns above the last column. Total column count is 63 including last column. But when I import the schema I get only 62.

From the source step to sink step I only get 62 columns when import schema.

I have this notification when I import the schema in sink.

But no such notification in the source even.
User's image

Tried without 'Auto mapping'. But then I don't have this 'YYYYMM' column in the column name drop-down to choose from. The column type is Integer.

I can read the column properly when I read using a Synapse notebook via PySpark.

Any suggestions please?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-02-27T20:47:08.79+00:00

    Hello @Rub Ire Pos,

    Welcome to the MS Q&A platform.

    You seem to be facing an issue with the schema import while reading a parquet file using a Data Flow Activity in Azure Synapse Analytics.

    As per the documentation, the columns coming into your data flow from your source definition are defined as "drifted" when they are not present in your source projection.

    You can view your source projection from the projection tab in the source transformation. When you select a dataset for your source, the service will automatically take the schema from the dataset and create a projection from that dataset schema definition.

    In a source transformation, schema drift is defined as reading columns that aren't defined in your dataset schema. To enable schema drift, check Allow schema drift in your source transformation.

    In a sink transformation, schema drift is when you write additional columns on top of what is defined in the sink data schema.

    To enable schema drift, check Allow schema drift in your sink transformation.

    If schema drift is enabled, ensure the Auto-mapping slider in the Mapping tab is turned on.

    With this slider on, all incoming columns are written to your destination. Otherwise, you must use rule-based mapping to write drifted columns.

    You can access drifted columns in your transformations with the following methods:

    • Use the byPosition and byName expressions to explicitly reference a column by name or position number.
    • Add a column pattern in a Derived Column or Aggregate transformation to match any combination of name, stream, position, origin, or type.
    • Add rule-based mapping in a Select or Sink transformation to match drifted columns to columns aliases via a pattern

    I hope this helps. Please let me know if you have any further questions.

    Reference document: https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-schema-drift

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.