More convenient service to read avro files from Azure Data Lake Gen2

Question

Hi,

I have to read lots of avro files created by an Event Hub Capture in a Data Lake Gen2. Data must be filtered, processed and then applied to train a machine learning model. I'm considering Azure Databricks and the Azure Machine Learning service itself for this ETL.

What is the best option in order to take advantage of the hierarchical namespace of files in the lake? Is it definitely Databricks, due to the Hadoop compatible access to data? What about working with datastores and the python SDK in AML service? Would be the data access efficiency comparable?

One critical requirement I have is the data filtering step, i.e. reading from the lake just the captured avro files containing specific data (unable to be inferred from the file path though). Does Spark-avro in Databricks give some advantage in this regard? For example with respect to the azure.storage.filedatalake python package, which doesn't offer avro-specific functions.

Thanks!

Answer

Hello @acedola

Welcome to the Q&A .

You never mentioned as to how the data is structured on the HNS as thats the key . You can read about that here https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace

On the data filtering part , i think ADB will just do fine , moreover the data is well structure on the HNS , it will be more performant .

Thanks Himanshu

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

More convenient service to read avro files from Azure Data Lake Gen2

1 answer