Reading Avro format in Synapse Analytics

Alex Le Blanc 1 Reputation point
2020-07-31T22:52:58.567+00:00

I am trying to read and process avro files from ADLS using a Spark pool notebook in Azure Synapse Analytics. The avro files are capture files produced by eventhub.

When I run df = spark.read.format("avro").load(<file path>) as I would in databricks, I get the following error:
"
AnalysisException : 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
"

I have also tried creating a "dataset" with a linked service but no luck with that either.
I have tried adding spark-avro_2.12 as a package but I can't seem to install it, I can only install python packages to my spark pool.

Is there currently a way to read avro files within synapse analytics? If not, are there plans to have avro read capabilities built-in in the near future? What are other methods I can use to read avro for the time being?

Any and all help is much appreciated, thank you!

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,004 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. HimanshuSinha-msft 19,476 Reputation points Microsoft Employee
    2020-08-03T20:55:08.363+00:00

    Hello Alex ,
    Thanks for the question and also using the forum .

    While I am reaching out to the the Synapse team , I wanted to let you know that you can use the Azure data factory ( ADF ) to read the AVRO files . You can go through the below doc and you will have some idea about the implementation , let me know if you have any questions .

    https://learn.microsoft.com/en-us/azure/data-factory/format-avro

    Thanks Himanshu

    Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

    1 person found this answer helpful.
    0 comments No comments

  2. Alex Le Blanc 1 Reputation point
    2020-08-04T16:37:12.013+00:00

    Thank you @HimanshuSinha-msft for your reply.

    If I understand correctly, at this time it is not possible to do a simple spark.read.format("avro") (like I would in databricks), correct? but the feature may be available in the future?

    And to clarify, by using Azure Data factory, do you mean the separate ADF service, or the one integrated into Synapse? (note: we are currently exploring Synapse and its viability for our solution. If we are to use it, we would want to only use the built-in data factory, not the external version). I have already tried creating a dataset within synapse, however I get this error message:
    "Column: SystemProperties,Location: Source,Format: Avro,The data type 'System.Collections.Generic.Dictionary`2[System.String,System.Object]' is currently not supported by Avro format."
    Any idea as to why I might be getting this error message? Is avro not supported for ADF datasets either in synapse? if not, will it be in the future?

    When searching this error, I found these forums. which seem to suggest that complex types in avro format are not supported, has this been addressed since then?
    17373472-support-more-complex-types-in-avro-format-like-di
    error-importing-avro-file-generated-by-event-hubs-archive-using-copy-data-tool-i

    Thank you,

    Alex


  3. Eduard Zvenigorodsky 1 Reputation point
    2021-05-28T04:12:30.573+00:00

    Azure databricks easily reads avro files:

    %python
    df = spark.read.format("avro").load("<path to your event hub>/0/2021/05/*/*/*/*.avro")
    js = df.select(df.Body.cast("string")).rdd.map(lambda x: x[0])
    data=spark.read.json(js)
    display(data)

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.