Using Delta Tables in Azure Synapse - use across Notebooks and Data Flows

Question

Using Delta Tables in Azure Synapse - use across Notebooks and Data Flows

Stephen Connell 21

Hi I am experiencing some issues with Delta table in Azure Synapse Workspace. I have worked with Notebooks to create Delta Tables which I then try to use in Data Flow transformations and I have used Data Flows to create Delta Table and then tried to modify or describe these tables in Notebooks. Under both of these scenarios it seems that the two approaches are not compatible with one another.

When I create a Delta Table in a Notebook it causes the following issues within Data Flows.
I have tried this with both Spark 3.2 Delta 1.2 Apache Pool and Spark 3.1. Delta 1.0.

1) SparkSQL Create Table method.

CREATE TABLE [DATABASE].[TABLE] (  
ID INT NOT NULL  
…  
)  
USING DELTA;

This can be read in a Source using Inline Delta :

When this statement was run using the Spark 3.2 Delta 1.2 Apache Pool attempting to write to the table generates the error.

Delta protocol version is too new for this version of the Databricks Runtime. Please upgrade to a newer release..

Using the prior library for the Table create statement does allow this to work.

The adding of a check constraint to the table prevents the writes into the table. This appears to change the minWriteVersion from 2 to 3 in the log.

ALTER TABLE [DATABASE].[TABLE] ADD CONSTRAINT [CONSTRAINTNAME] CHECK ([CONDITION]);

We then get the same error as we previously had for the CREATE on the Pool with Delta 1.2 Library.

2) Pyspark DataFrame.write.format(delta) method.

e.g.

df.write.format("delta").save(delta_table_path)

When written with the most recent Delta Library (1.2) we get the above failures but with the older library 1.0 this method of creating tables is OK.

I have experience of a bug with the earlier library which renders any transaction prior to a checkpoint invalid for time travel so it is not a good solution for creating Delta Tables.

Conversely, if I write using a Data Flow to create a Delta Table and then attempt to work with this in a notebook I get errors.

%%sql  
CREATE TABLE IF NOT EXISTS GreenTaxi.Trips  
USING DELTA   
LOCATION 'taxi/delta/Green/Simple/';

I get an error:

Error: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:null)

If I try to define the table into Spark say to read the history:

from delta.tables import *
from pyspark.sql.functions import *
delta_table = DeltaTable.forPath(spark, "taxi/delta/Green/Simple/")
delta_table.history().show(20, 1000, False)

I get the message:

AnalysisException: taxi/delta/Green/Simple/ is not a Delta table.

I have a couple of questions

Is there some mechanism to set the library used within an Azure Integration Runtime to ensure it is compatible with the version of Delta being used by Spark Pools?
Are there any suggestions about the best way to use Delta within Synapse and best practice?
What are the plans for upgrading the Delta library for Azure Integration Runtimes?

Happy to add more detail if required.
Kind regards.
Stephen.

Stephen Connell 21 Reputation points

2022-11-14T09:03:34.707+00:00
I note that there are several links to the libraries used by Integration runtimes:

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-32-runtime

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-24-runtime

This presents new questions:

How to detect what version is used by an IR?

If Spark 3.2 the delta library is listed as 1.2. Given this how is the experience described above explained?
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-15T02:47:16.527+00:00

Hello @Stephen Connell ,

Thanks for the question and using MS Q&A platform.

Could you please help clarify on the below statement?

Do you mean that you received the error message when you tried the create statement using Apache Spark pool 3.2? or it throws error when you tried to write/update the delta table using mapping data flow inline dataset? When I just tried executing the create delta tables using both Apache Spark pool version 3.2 as well as 3.1 and haven't noticed any issues, was able to create the tables. Please correct if my understanding is not accurate/clear.

Since you mentioned your mapping data flow works with delta table version 1.0 (nothing but spark 3.1) and failing with delta table version 1.2 (Spark 3.2) - The only reason I could think of is that in the past data flow uses 2.4 Spark runtime version and I believe recently the product team upgraded it to spark 3.1. Which is why you don't see issue with Spark 3.1 delta table version but it fails with error message "Delta protocol version is too new for this version of the Databricks Runtime. Please upgrade to a newer release.. " When you run your mapping data flow against delta table version 1.2 (Spark 3.2).
Stephen Connell 21 Reputation points

2022-11-15T09:37:32.337+00:00
@KranthiPakala-MSFT
These are my repo steps:

Using Spark 3.2 Apache Pool (Delta 1.2) I Create a Delta table using a Spark SQL statement

Create a data flow and using an Inline statement attempt to write the data to the previously created Delta table

The issue is that I cannot use Delta tables created with the most up to date library in Notebooks in Mapping Data Flows and cannot use Delta tables created by Mapping Data Flows using Sparks Notebooks. This presents a couple of issues:

Using Create Table statements is intuitive and time saving. It also allows NULL constraints. Creating tables in Mapping Data Flows does not allow NULL constraints.

If I create Delta tables using Mapping Data Flows I cannot pass these tables to data scientists to work with them in Notebooks.
The image you supplied does not render for me. So can't see if it is relevant. My Synapse is in West Europe.
I've attached my Runtime Definitions.260486-runtimes.txt
Stephen Connell 21 Reputation points

2022-11-17T09:10:03.867+00:00
@KranthiPakala-MSFT

Since you mentioned your mapping data flow works with delta table version 1.0 (nothing but spark 3.1) and failing with delta table version 1.2 (Spark 3.2) - The only reason I could think of is that in the past data flow uses 2.4 Spark runtime version and I believe recently the product team upgraded it to spark 3.1. Which is why you don't see issue with Spark 3.1 delta table version but it fails with error message "Delta protocol version is too new for this version of the Databricks Runtime. Please upgrade to a newer release.. " When you run your mapping data flow against delta table version 1.2 (Spark 3.2).

Is it possible to check with the product team:

Is it the cast that Azure IR uses Spark 3.1 with Delta 1.0?

If yes, what is the roadmap for updating to Spark 3.2 with Delta 1.2?

In future will the Azure IR keep pace with releases for Apache Spark Pool?
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-17T22:10:38.933+00:00

Hi @Stephen Connell ,

Thanks for your response. Yes, I have already reached out to product regarding this issue and waiting for a response. Will keep you posted as soon as I hear back from them.

Is it the case that Azure IR uses Spark 3.1 with Delta 1.0? - Yes, you are correct which is why you are seeing the discrepancy with Mapping data flows.

If yes, what is the roadmap for updating to Spark 3.2 with Delta 1.2?
In future will the Azure IR keep pace with releases for Apache Spark Pool?

Will confirm on the last questions once we hear back from the product team.

Thank you for your patience.

FYI - @MarkKromer-MSFT , @Anonymous
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-23T02:54:57.357+00:00

Hello @Stephen Connell ,

Just wanted to let you know that I'm still waiting for a response from produce team. Will keep you posted as soon as we have an update on the same.

Thank you for your patience.
Stephen Connell 21 Reputation points

2022-11-23T14:35:33.66+00:00

Thanks for the update.
I noticed that a bug fix for columns on Delta Tables in Lake Databases has been released so fingers crossed other Delta fixes are in the pipeline.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-24T03:25:30.54+00:00
Hi @Stephen Connell ,

As per my conversation with product team, below are the responses for your query:

If yes, what is the roadmap for updating to Spark 3.2 with Delta 1.2?
The answer YES, but we do not have an ETA yet. Dataflows are currently moving to spark 3.1, and we are looking to move to next versions faster.

In future will the Azure IR keep pace with releases for Apache Spark Pool?
YES, product team is moving with pace on 3.1 and then to newer versions of spark and try to keep pace with releases of Apache Spark.

Hope this clarifies. But if you still have any feedback regarding the product or any improvements, I recommend you to please log it in IDEAS forum here and do share the link with me here so that I can pass it to product team for further review. Here is the IDEAS forum link: https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

Thank you

1 answer

Your answer

Stephen Connell 21 Reputation points

2022-11-14T09:03:34.707+00:00

I note that there are several links to the libraries used by Integration runtimes:

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-32-runtime

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-24-runtime

This presents new questions:

How to detect what version is used by an IR?

If Spark 3.2 the delta library is listed as 1.2. Given this how is the experience described above explained?
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-15T02:47:16.527+00:00

Hello @Stephen Connell ,

Thanks for the question and using MS Q&A platform.

Could you please help clarify on the below statement?

Do you mean that you received the error message when you tried the create statement using Apache Spark pool 3.2? or it throws error when you tried to write/update the delta table using mapping data flow inline dataset? When I just tried executing the create delta tables using both Apache Spark pool version 3.2 as well as 3.1 and haven't noticed any issues, was able to create the tables. Please correct if my understanding is not accurate/clear.

Since you mentioned your mapping data flow works with delta table version 1.0 (nothing but spark 3.1) and failing with delta table version 1.2 (Spark 3.2) - The only reason I could think of is that in the past data flow uses 2.4 Spark runtime version and I believe recently the product team upgraded it to spark 3.1. Which is why you don't see issue with Spark 3.1 delta table version but it fails with error message "Delta protocol version is too new for this version of the Databricks Runtime. Please upgrade to a newer release.. " When you run your mapping data flow against delta table version 1.2 (Spark 3.2).
Stephen Connell 21 Reputation points

2022-11-15T09:37:32.337+00:00

@KranthiPakala-MSFT
These are my repo steps:

Using Spark 3.2 Apache Pool (Delta 1.2) I Create a Delta table using a Spark SQL statement

Create a data flow and using an Inline statement attempt to write the data to the previously created Delta table

The issue is that I cannot use Delta tables created with the most up to date library in Notebooks in Mapping Data Flows and cannot use Delta tables created by Mapping Data Flows using Sparks Notebooks. This presents a couple of issues:

Using Create Table statements is intuitive and time saving. It also allows NULL constraints. Creating tables in Mapping Data Flows does not allow NULL constraints.

If I create Delta tables using Mapping Data Flows I cannot pass these tables to data scientists to work with them in Notebooks.
The image you supplied does not render for me. So can't see if it is relevant. My Synapse is in West Europe.
I've attached my Runtime Definitions.260486-runtimes.txt
Stephen Connell 21 Reputation points

2022-11-17T09:10:03.867+00:00

@KranthiPakala-MSFT

Since you mentioned your mapping data flow works with delta table version 1.0 (nothing but spark 3.1) and failing with delta table version 1.2 (Spark 3.2) - The only reason I could think of is that in the past data flow uses 2.4 Spark runtime version and I believe recently the product team upgraded it to spark 3.1. Which is why you don't see issue with Spark 3.1 delta table version but it fails with error message "Delta protocol version is too new for this version of the Databricks Runtime. Please upgrade to a newer release.. " When you run your mapping data flow against delta table version 1.2 (Spark 3.2).

Is it possible to check with the product team:

Is it the cast that Azure IR uses Spark 3.1 with Delta 1.0?

If yes, what is the roadmap for updating to Spark 3.2 with Delta 1.2?

In future will the Azure IR keep pace with releases for Apache Spark Pool?
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-17T22:10:38.933+00:00

Hi @Stephen Connell ,

Thanks for your response. Yes, I have already reached out to product regarding this issue and waiting for a response. Will keep you posted as soon as I hear back from them.

Is it the case that Azure IR uses Spark 3.1 with Delta 1.0? - Yes, you are correct which is why you are seeing the discrepancy with Mapping data flows.

If yes, what is the roadmap for updating to Spark 3.2 with Delta 1.2?
In future will the Azure IR keep pace with releases for Apache Spark Pool?

Will confirm on the last questions once we hear back from the product team.

Thank you for your patience.

FYI - @MarkKromer-MSFT , @Anonymous
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-23T02:54:57.357+00:00

Hello @Stephen Connell ,

Just wanted to let you know that I'm still waiting for a response from produce team. Will keep you posted as soon as we have an update on the same.

Thank you for your patience.
Stephen Connell 21 Reputation points

2022-11-23T14:35:33.66+00:00

Thanks for the update.
I noticed that a bug fix for columns on Delta Tables in Lake Databases has been released so fingers crossed other Delta fixes are in the pipeline.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2022-11-24T03:25:30.54+00:00

Hi @Stephen Connell ,

As per my conversation with product team, below are the responses for your query:

If yes, what is the roadmap for updating to Spark 3.2 with Delta 1.2?
The answer YES, but we do not have an ETA yet. Dataflows are currently moving to spark 3.1, and we are looking to move to next versions faster.

In future will the Azure IR keep pace with releases for Apache Spark Pool?
YES, product team is moving with pace on 3.1 and then to newer versions of spark and try to keep pace with releases of Apache Spark.

Hope this clarifies. But if you still have any feedback regarding the product or any improvements, I recommend you to please log it in IDEAS forum here and do share the link with me here so that I can pass it to product team for further review. Here is the IDEAS forum link: https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

Thank you

Answer 1

Hi I thought that I would follow up.
Mapping Data flow Inline Source and Sink for Delta now read and write for Delta tables which have

Key	Value
delta.minReaderVersion	1
delta.minWriterVersion	3

This was not true mid-August so great to see that resolved finally.
This means Delta with Table Constraints.

Operation on target Data flow1 failed: {"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink1': Cannot write to table with delta.enableChangeDataFeed set. Change data feed from Delta is not yet available.","Details":"org.apache.spark.sql.AnalysisException: Cannot write to table with delta.enableChangeDataFeed set. Change data feed from Delta is not yet available.\n\tat org.apache.spark.sql.delta.DeltaErrors$.cdcWriteNotAllowedInThisVersion(DeltaErrors.scala:407)\n\tat org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:156)\n\tat org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:150)\n\tat org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:84)\n\tat org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:143)\n\tat org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:142)\n\tat org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:84)\n\tat org.apache.spark.sql.delta.commands.WriteIntoDelta.write(WriteIntoDelta.scala:107)\n\tat org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:66)\n\tat org.apac"}

In further testing, I see that enabling CDC on a table which raises the minWriterVersion to 4 does not yet work. However there is definite progress.

Share via

Using Delta Tables in Azure Synapse - use across Notebooks and Data Flows

1 answer

Your answer