Using XML in Synapse spark pool

Ljubo Jurkovic 66 Reputation points
2022-07-27T14:10:39.577+00:00

Hi,
Can anybody help with using the XML in Synapse spark pool with pyspark? I found some articles where they suggest a code like this would load the XML into a data frame, but I get an error when trying it:
df=spark.read.format("com.databricks.spark.xml").option("rootTag", "Catalog").option("rowTag","book").load("books.xml")
The error is this:
"java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml"
Apparently, the package with the com.databricks.spark.xm format can be used in Synapse Analytics, but I don't know what should I list in the requirements.txt file in the spark configuration to get this loaded.
If somebody can provide the steps it would be greatly appreciated.

LJ

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,395 questions
0 comments No comments
{count} votes

Accepted answer
  1. Martin Cairney 2,241 Reputation points
    2022-07-28T01:20:51.17+00:00

    An alternative to the
    requirements.txt
    approach is to upload the JAR fiule as a workload package.

    You can download the JAR from here.

    Once you have that then follow the instructions here. Note specifically the step
    You can also select additional workspace packages to add Jar or Wheel files to your pool.
    which allows you to upload the JAR.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Ljubo Jurkovic 66 Reputation points
    2022-07-28T15:12:08.967+00:00

    Thanks Martin. This worked.
    I wonder why it doesn't work with the requirements.txt. It could be I didn't list the library name correctly or spark-xml library doesn't exist where Synapse is trying to install it using requirements.txt.

    0 comments No comments