Creating datasets in Azure Machine Learning service from more than 100 paths

Ariel Cedola 21 Reputation points
2020-06-09T16:08:40.387+00:00

Hi,

I need to create a dataset in Azure Machine Learning service from an Azure Data Lake Gen2 registered as a Datastore. Data in the lake are 1000's of avro files stored by an Event Hub Capture following the pattern [EventHub]/[Partition]/[YYYY]/[MM]/[DD]/[HH]/[mm]/[ss], so there is one path for each file.

According to the datasets documentation it is recommended "... creating dataset referencing less than 100 paths in datastores for optimal performance."

What would be the alternative/recommended approach in my application? Streaming data are continuously captured by the Event Hub.

Thanks

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,340 questions
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,563 questions
{count} votes

1 answer

Sort by: Most helpful
  1. May Hu 1 Reputation point
    2020-06-11T04:48:23.87+00:00

    Hi,

    You can create dataset with globing pattern.
    ds = Dataset.File.from_files((datastore, '[EventHub]/[Partition]/**))

    The mount time should be less than 1 min.