More convenient service to read avro files from Azure Data Lake Gen2

Ariel Cedola 21 Reputation points
2020-06-08T22:37:56.897+00:00

Hi,

I have to read lots of avro files created by an Event Hub Capture in a Data Lake Gen2. Data must be filtered, processed and then applied to train a machine learning model. I'm considering Azure Databricks and the Azure Machine Learning service itself for this ETL.

What is the best option in order to take advantage of the hierarchical namespace of files in the lake? Is it definitely Databricks, due to the Hadoop compatible access to data? What about working with datastores and the python SDK in AML service? Would be the data access efficiency comparable?

One critical requirement I have is the data filtering step, i.e. reading from the lake just the captured avro files containing specific data (unable to be inferred from the file path though). Does Spark-avro in Databricks give some advantage in this regard? For example with respect to the azure.storage.filedatalake python package, which doesn't offer avro-specific functions.

Thanks!

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,425 questions
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,718 questions
Azure Event Hubs
Azure Event Hubs
An Azure real-time data ingestion service.
598 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,073 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,386 Reputation points Microsoft Employee
    2020-06-10T23:48:13.263+00:00

    Hello @acedola

    Welcome to the Q&A .

    You never mentioned as to how the data is structured on the HNS as thats the key . You can read about that here https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace

    On the data filtering part , i think ADB will just do fine , moreover the data is well structure on the HNS , it will be more performant .

    Thanks Himanshu

    Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members