Creating datasets in Azure Machine Learning service from more than 100 paths

Ariel Cedola 21 Reputation points


I need to create a dataset in Azure Machine Learning service from an Azure Data Lake Gen2 registered as a Datastore. Data in the lake are 1000's of avro files stored by an Event Hub Capture following the pattern [EventHub]/[Partition]/[YYYY]/[MM]/[DD]/[HH]/[mm]/[ss], so there is one path for each file.

According to the datasets documentation it is recommended "... creating dataset referencing less than 100 paths in datastores for optimal performance."

What would be the alternative/recommended approach in my application? Streaming data are continuously captured by the Event Hub.


Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
930 questions
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
1,707 questions
{count} votes

1 answer

Sort by: Most helpful
  1. May Hu 1 Reputation point


    You can create dataset with globing pattern.
    ds = Dataset.File.from_files((datastore, '[EventHub]/[Partition]/**))

    The mount time should be less than 1 min.