Creating datasets in Azure Machine Learning service from more than 100 paths

Question

Hi,

I need to create a dataset in Azure Machine Learning service from an Azure Data Lake Gen2 registered as a Datastore. Data in the lake are 1000's of avro files stored by an Event Hub Capture following the pattern [EventHub]/[Partition]/[YYYY]/[MM]/[DD]/[HH]/[mm]/[ss], so there is one path for each file.

According to the datasets documentation it is recommended "... creating dataset referencing less than 100 paths in datastores for optimal performance."

What would be the alternative/recommended approach in my application? Streaming data are continuously captured by the Event Hub.

Thanks

Answer

Hi,

You can create dataset with globing pattern.
ds = Dataset.File.from_files((datastore, '[EventHub]/[Partition]/**))

The mount time should be less than 1 min.

Creating datasets in Azure Machine Learning service from more than 100 paths

1 answer