Hello
Thanks for reaching out to us, one choice you may want to consider is MLtable - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?view=azureml-api-2&tabs=cli
Azure Machine Learning supports a Table type (mltable
). This allows for the creation of a blueprint that defines how to load data files into memory as a Pandas or Spark data frame.
This is very similar to the scenario you described.
Azure Machine Learning Tables (mltable
) allow you to define how you want to load your data files into memory, as a Pandas and/or Spark data frame. Tables have two key features:
- An MLTable file. A YAML-based file that defines the data loading blueprint. In the MLTable file, you can specify:
- The storage location(s) of the data - local, in the cloud, or on a public http(s) server.
- Globbing patterns over cloud storage. These locations can specify sets of filenames, with wildcard characters (
*
). - read transformation - for example, the file format type (delimited text, Parquet, Delta, json), delimiters, headers, etc.
- Column type conversions (enforce schema).
- New column creation, using folder structure information - for example, creation of a year and month column, using the
{year}/{month}
folder structure in the path. - Subsets of data to load - for example, filter rows, keep/drop columns, take random samples.
- A fast and efficient engine to load the data into a Pandas or Spark dataframe, according to the blueprint defined in the MLTable file. The engine relies on Rust for high speed and memory efficiency.
Please take a look at above and let me know if this is what you are looking for, thanks.
Regards,
Yutong
-Please kindly accept the answer and vote yes if you feel helpful to support the community, thanks a lot.