Reading Avro format in Synapse Analytics

Question

Reading Avro format in Synapse Analytics

Alex Le Blanc 1

I am trying to read and process avro files from ADLS using a Spark pool notebook in Azure Synapse Analytics. The avro files are capture files produced by eventhub.

When I run df = spark.read.format("avro").load(<file path>) as I would in databricks, I get the following error:
"
AnalysisException : 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
"

I have also tried creating a "dataset" with a linked service but no luck with that either.
I have tried adding spark-avro_2.12 as a package but I can't seem to install it, I can only install python packages to my spark pool.

Is there currently a way to read avro files within synapse analytics? If not, are there plans to have avro read capabilities built-in in the near future? What are other methods I can use to read avro for the time being?

Any and all help is much appreciated, thank you!

3 answers

Your answer

Answer 1

Hello Alex ,
Thanks for the question and also using the forum .

While I am reaching out to the the Synapse team , I wanted to let you know that you can use the Azure data factory ( ADF ) to read the AVRO files . You can go through the below doc and you will have some idea about the implementation , let me know if you have any questions .

https://learn.microsoft.com/en-us/azure/data-factory/format-avro

Thanks Himanshu

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

Answer 2

Alex Le Blanc 1

Thank you @HimanshuSinha-msft for your reply.

If I understand correctly, at this time it is not possible to do a simple spark.read.format("avro") (like I would in databricks), correct? but the feature may be available in the future?

And to clarify, by using Azure Data factory, do you mean the separate ADF service, or the one integrated into Synapse? (note: we are currently exploring Synapse and its viability for our solution. If we are to use it, we would want to only use the built-in data factory, not the external version). I have already tried creating a dataset within synapse, however I get this error message:
"Column: SystemProperties,Location: Source,Format: Avro,The data type 'System.Collections.Generic.Dictionary`2[System.String,System.Object]' is currently not supported by Avro format."
Any idea as to why I might be getting this error message? Is avro not supported for ADF datasets either in synapse? if not, will it be in the future?

When searching this error, I found these forums. which seem to suggest that complex types in avro format are not supported, has this been addressed since then?
17373472-support-more-complex-types-in-avro-format-like-di
error-importing-avro-file-generated-by-event-hubs-archive-using-copy-data-tool-i

Thank you,

Alex

Alex Le Blanc 1 Reputation point

2020-08-04T16:55:02.643+00:00

Hi @HimanshuSinha-msft

As a quick follow up:

I have looked deeper into the documentation link you provided. It suggests that the COPY activity does not support avro complex data types, however these complex avro data types can be read using data flows. It does not mention whether complex data types are supported or not by Datasets. Is it safe to assume that they are not supported as of today?

So when you suggest using data factory, you mean to use data flows? I am a bit confused by the distinction between data flows and datasets... when I try to create a data flow, it requires that I read my data as a Dataset. which would give me an error when I tried to preview the data when creating the dataset, so I don't see how creating the data flow will work...

Thanks again,

Alex
HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2020-08-17T22:10:24.99+00:00

Hello ,
My sincere apoloziges for the delay on rsponse on my side.
Yes at this time Azure data factory does not support complex types , but data flow does .
Please foillow the link here
https://learn.microsoft.com/en-us/azure/data-factory/format-avro#data-type-support

Thanks Himanshu

Answer 3

Eduard Zvenigorodsky 1

Azure databricks easily reads avro files:

%python
df = spark.read.format("avro").load("<path to your event hub>/0/2021/05/*/*/*/*.avro")
js = df.select(df.Body.cast("string")).rdd.map(lambda x: x[0])
data=spark.read.json(js)
display(data)

Share via

Reading Avro format in Synapse Analytics

3 answers

Your answer