Spark Job Definition: Use delta library

Question

Spark Job Definition: Use delta library

JLopez 61

Hi all, I am trying to use the delta library in a PySpark Spark job definition with a 3.3 spark cluster but I cannot make it work. My code is:

spark = SparkSession.builder.appName("DataLoad") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate()
import delta

deltaTable = DeltaTable.forName(spark,LoadZone+'.'+TableName)
dfLogs=deltaTable.history()

The error is:

  File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/filecache/12/delta.py", line 120, in log_execution
    deltaTable = DeltaTable.forName(spark,LoadZone+'.'+TableName)
NameError: name 'DeltaTable'

So it looks like it cannot import the library, how can I use the delta library inside my job definition?

PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-05-17T08:16:12.0033333+00:00
@JLopez - Thanks for the question and using MS Q&A platform.

To use the Delta Lake library in a Spark job definition in Azure Synapse, you need to make sure that the library is available on the Spark cluster. You can do this by adding the Delta Lake library as a dependency when you create your Spark pool.

Here's how you can create a Spark pool with the Delta Lake library as a dependency:

Open the Azure Synapse workspace in the Azure portal.

Navigate to the "Manage" tab and click on "Apache Spark pools".

Click on "New" to create a new Spark pool.

In the "Basics" tab, enter a name and select the subscription, resource group, and workspace for your Spark pool.

In the "Configuration" tab, scroll down to the "Advanced" section and click on "Edit".

In the "Spark configuration" dialog, click on "Add" to add a new configuration property.

Enter spark.jars.packages as the key and io.delta:delta-core_2.12:1.0.0 as the value.

Click on "OK" to save the configuration property.

Click on "Review + create" to review your Spark pool configuration, and then click on "Create" to create the Spark pool.

Once you have created your Spark pool with the Delta Lake library as a dependency, you can use the DeltaTable class in your Spark job definition. Here's an example of how to use the DeltaTable class in a Spark job definition:

from delta.tables import DeltaTable # Load a Delta table deltaTable = DeltaTable.forPath(spark, "/path/to/delta/table") # Read data from the Delta table df = deltaTable.toDF() # Write data to the Delta table df.write.format("delta").mode("append").save("/path/to/delta/table")

In this example, we are using the DeltaTable class to load a Delta table, read data from the table, and write data to the table.

I hope this helps! Let me know if you have any further questions.
JLopez 61 Reputation points

2023-05-17T08:27:31.47+00:00

Hi @PRADEEPCHEEKATLA , thanks for your answer.

The spark pool is inside Synapse so my assumption is that Synapse Spark 3.3 cluster is preinstalled with Delta 2.2 version so I should not install any Delta library. Is this correct? I tested Delta library with notebooks and it is working without any further installation

Thank you!!
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-05-18T03:50:39.9066667+00:00

@JLopez - Yes, you are correct. Synapse Spark pool 3.3 which comes with Delta 2.2 version.

Note: In case, if you want to install Delta other than the version preinstalled - you can use the above method.
JLopez 61 Reputation points

2023-05-18T07:16:37.8866667+00:00

Hi @PRADEEPCHEEKATLA , I will try to upload the library by myself but I did that in the past through workspace packages and it did not work.

The key point for me is that in a notebook it is working without issues but it cannot work from a job definition. It should mean a problem in my code or a problem in the library for job definitions
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-05-19T08:19:00.1933333+00:00

@JLopez - Could you please share the complete stack trace of the error message to idnetify the issue?
ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2023-05-23T10:24:53.03+00:00

@JLopez - just checking if you get chance to share updates. Thank you.

1 answer

Your answer

PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-05-17T08:16:12.0033333+00:00

@JLopez - Thanks for the question and using MS Q&A platform.

To use the Delta Lake library in a Spark job definition in Azure Synapse, you need to make sure that the library is available on the Spark cluster. You can do this by adding the Delta Lake library as a dependency when you create your Spark pool.

Here's how you can create a Spark pool with the Delta Lake library as a dependency:

Open the Azure Synapse workspace in the Azure portal.

Navigate to the "Manage" tab and click on "Apache Spark pools".

Click on "New" to create a new Spark pool.

In the "Basics" tab, enter a name and select the subscription, resource group, and workspace for your Spark pool.

In the "Configuration" tab, scroll down to the "Advanced" section and click on "Edit".

In the "Spark configuration" dialog, click on "Add" to add a new configuration property.

Enter spark.jars.packages as the key and io.delta:delta-core_2.12:1.0.0 as the value.

Click on "OK" to save the configuration property.

Click on "Review + create" to review your Spark pool configuration, and then click on "Create" to create the Spark pool.

Once you have created your Spark pool with the Delta Lake library as a dependency, you can use the DeltaTable class in your Spark job definition. Here's an example of how to use the DeltaTable class in a Spark job definition:

from delta.tables import DeltaTable # Load a Delta table deltaTable = DeltaTable.forPath(spark, "/path/to/delta/table") # Read data from the Delta table df = deltaTable.toDF() # Write data to the Delta table df.write.format("delta").mode("append").save("/path/to/delta/table")

In this example, we are using the DeltaTable class to load a Delta table, read data from the table, and write data to the table.

I hope this helps! Let me know if you have any further questions.
JLopez 61 Reputation points

2023-05-17T08:27:31.47+00:00

Hi @PRADEEPCHEEKATLA , thanks for your answer.

The spark pool is inside Synapse so my assumption is that Synapse Spark 3.3 cluster is preinstalled with Delta 2.2 version so I should not install any Delta library. Is this correct? I tested Delta library with notebooks and it is working without any further installation

Thank you!!
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-05-18T03:50:39.9066667+00:00

@JLopez - Yes, you are correct. Synapse Spark pool 3.3 which comes with Delta 2.2 version.

Note: In case, if you want to install Delta other than the version preinstalled - you can use the above method.
JLopez 61 Reputation points

2023-05-18T07:16:37.8866667+00:00

Hi @PRADEEPCHEEKATLA , I will try to upload the library by myself but I did that in the past through workspace packages and it did not work.

The key point for me is that in a notebook it is working without issues but it cannot work from a job definition. It should mean a problem in my code or a problem in the library for job definitions
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-05-19T08:19:00.1933333+00:00

@JLopez - Could you please share the complete stack trace of the error message to idnetify the issue?
ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2023-05-23T10:24:53.03+00:00

@JLopez - just checking if you get chance to share updates. Thank you.

Answer 1

Hi @ShaikMaheer-MSFT

I did a test creating a simple spark job definition configured with the library and it worked with Delta commands. Then I tried my old spark job definition configured with the library and it did not work so the issue persists even if for another definition the Delta was working. I decided to create a new spark job definition with my code and it worked so I supose the issue is in the job definition but I cannot understand why. I will do more tests.

Thanks for your help!!

Share via

Spark Job Definition: Use delta library

1 answer

Your answer