mltable Module
Contains functionality to create and interact with MLTable objects
Classes
DataType |
Helper class for handling the proper manipulation of supported column types (int, bool, string, etc.). Currently used with MLTable.convert_column_types(...) & from_delimited_files(...) for specifying which types to convert columns to. Different types are selected with DataType.from_(...)* methods. |
MLTable |
Represents a MLTable. A MLTable defines a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data. Initialize a new MLTable. This constructor is not supposed to be invoked directly. MLTable is intended to be created using load. |
Metadata |
Class that maps to the metadata section of the MLTable. Supports the getting & adding of arbritrary metadata properties. |
Traits |
Class that maps to the traits section of the MLTable. Currently supported traits: timestamp_column and index_columns |
Enums
MLTableFileEncoding |
Defines options for how encoding are processed when reading data from files to create a MLTable. These enumeration values are used in the MLTable class. |
MLTableHeaders |
Defines options for how column headers are processed when reading data from files to create a MLTable. These enumeration values are used in the MLTable class. |
MLTablePartitionSize |
Helper enum representing the memory allocated for reading various partitions for select file formats across different memory units. Currently used when reading delimited or JSON lines files. Supports bytes, kilobytes, megabytes, and gigabytes as memory units - in binary. |
MLTableSaveOverwriteOption |
Defines options for how to handle file conflicts in MLTable.save(). EIther raise an error if a conflict occurs, overwrite the existing file with the new file, or leave the existing file as is. |
Functions
from_delimited_files
Creates a MLTable from the given list of delimited files.
from_delimited_files(paths, header='all_files_same_headers', delimiter=',', support_multi_line=False, empty_as_string=False, encoding='utf8', include_path_column=False, infer_column_types=True)
Parameters
Name | Description |
---|---|
paths
Required
|
Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path. |
header
Required
|
How column headers are handled when reading from files. Options specified using the enum MLTableHeaders. Supported headers are 'no_header', 'from_first_file', 'all_files_different_headers', and 'all_files_same_headers'. |
delimiter
Required
|
separator used to split columns |
support_multi_line
Required
|
If False, all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks. Given this csv file as example, the data will be read differently based on support_multi_line. A,B,C A1,B1,C1 A2,"B 2",C2
|
empty_as_string
Required
|
How empty fields should be handled. If True will read empty fields as empty strings, else read as nulls. If True and column contains datetime or numeric data, empty fields still read as nulls. |
encoding
Required
|
Specifies the file encoding using the enum MLTableFileEncoding. Supported encodings are:
|
include_path_column
Required
|
Keep path information as a column in the MLTable, is useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path. |
infer_column_types
Required
|
If True, automatically infers all column types. If False, leaves columns as strings. If a dictionary, represents columns whose types are to be set to given types (with all other columns being inferred). The dictionary may contain a key named sample_size mapped to a positive integer number, representing the number of rows to use for inferring column types. The dictionary may also contain a key named 'column_type_overrides'. Each key in the dictionary is either a string representing a column name or a tuple of strings representing a group of column names. Each value is either a string (one of 'boolean', 'string', 'float', or 'int') or a DataType. mltable.DataType.to_stream() is not supported. If an empty dictionary is given, assumed to be True. Defaults to True. An example of how to format infer_column_types.
|
Returns
Type | Description |
---|---|
MLTable |
Remarks
There must be a valid paths string.
# load mltable from local delimited file
from mltable import from_delimited_files
paths = [{"file": "./samples/mltable_sample/sample_data.csv"}]
mltable = from_delimited_files(paths)
from_delta_lake
Creates an MLTable object to read in Parquet files from delta lake table.
from_delta_lake(delta_table_uri, timestamp_as_of=None, version_as_of=None, include_path_column=False)
Parameters
Name | Description |
---|---|
delta_table_uri
Required
|
URI pointing to the delta table directory containing the delta lake parquet files to read. Supported URI types are: local path URI, storage URI, long-form datastore URI, or data asset uri. |
timestamp_as_of
Required
|
datetime string in RFC-3339/ISO-8601 format to use to read in matching parquet files from a specific point in time. ex) "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00", "2022-10-01T01:30:00-08:00" |
version_as_of
Required
|
integer version to use to read in a specific version of parquet files. |
include_path_column
Required
|
Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path. |
Returns
Type | Description |
---|---|
MLTable instance |
Remarks
from_delta_lake creates an MLTable object which defines the operations to load data from delta lake folder into tabular representation.
For the data to be accessible by Azure Machine Learning, path must point to the delta table directory and the delta lake files that are referenced must be accessible by AzureML services or behind public web urls.
from_delta_lake supports reading delta lake data from a uri pointing to: local path, Blob, ADLS Gen1 and ADLS Gen2
Users are able to read in and materialize the data by calling to_pandas_dataframe() on the returned MLTable
# create an MLTable object from a delta lake using timestamp versioning and materialize the data
from mltable import from_delta_lake
mltable_ts = from_delta_lake(delta_table_uri="./data/delta-01", timestamp_as_of="2021-05-24T00:00:00Z")
pd = mltable_ts.to_pandas_dataframe()
# create an MLTable object from a delta lake using integer versioning and materialize the data
from mltable import from_delta_lake
mltable_version = from_delta_lake(delta_table_uri="./data/delta-02", version_as_of=1)
pd = mltable_version.to_pandas_dataframe()
from_json_lines_files
Create a MLTable from the given list of JSON file paths.
from_json_lines_files(paths, invalid_lines='error', encoding='utf8', include_path_column=False)
Parameters
Name | Description |
---|---|
paths
Required
|
Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path. |
invalid_lines
Required
|
How to handle lines that are invalid JSON, can be 'drop' or 'error'. If 'drop' invalid lines are dropped, else error is raised. |
encoding
Required
|
Specifies the file encoding using the enum MLTableFileEncoding. Supported file encodings:
|
include_path_column
Required
|
Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path. |
Returns
Type | Description |
---|---|
MLTable |
Remarks
There must be a valid paths dictionary
# load mltable from local JSON paths
from mltable import from_json_lines_files
paths = [{'file': './samples/mltable_sample/sample_data.jsonl'}]
mltable = from_json_lines_files(paths)
from_parquet_files
Create the MLTable from the given list of parquet files.
from_parquet_files(paths, include_path_column=False)
Parameters
Name | Description |
---|---|
paths
Required
|
Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path. |
include_path_column
Required
|
Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path. |
Returns
Type | Description |
---|---|
MLTable instance |
Remarks
There must be a valid paths dictionary
# load mltable from local parquet paths
from mltable import from_parquet_files
paths = [{'file': './samples/mltable_sample/sample.parquet'}]
mltable = from_parquet_files(paths)
from_paths
Create the MLTable from the given paths.
from_paths(paths)
Parameters
Name | Description |
---|---|
paths
Required
|
Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path. |
Returns
Type | Description |
---|---|
MLTable instance |
Remarks
There must be a valid paths dictionary
# load mltable from local paths
from mltable import from_paths
tbl = from_paths([{'file': "./samples/mltable_sample"}])
# load mltable from cloud paths
from mltable import load
tbl = from_paths(
[{'file': "https://<blob-storage-name>.blob.core.windows.net/<path>/sample_file"}])
load
Loads the MLTable file (YAML) present at the given uri.
storage_options supports keys of 'subscription', 'resource_group', 'workspace', or 'location'. All must locate an Azure machine learning workspace.
load(uri, storage_options: dict = None, ml_client=None)
Parameters
Name | Description |
---|---|
uri
Required
|
uri supports long-form datastore uri, storage uri, local path, or data asset uri or data asset short uri |
storage_options
Required
|
AML workspace info when URI is an AML asset |
ml_client
Required
|
MLClient instance. To learn more, see https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.mlclient?view=azure-python |
Returns
Type | Description |
---|---|
MLTable |
Remarks
There must be a valid MLTable YAML file named 'MLTable' present at the given uri.
# load mltable from local folder
from mltable import load
tbl = load('.\samples\mltable_sample')
# load mltable from azureml datastore uri
from mltable import load
tbl = load(
'azureml://subscriptions/<subscription-id>/
resourcegroups/<resourcegroup-name>/workspaces/<workspace-name>/
datastores/<datastore-name>/paths/<mltable-path-on-datastore>/')
# load mltable from azureml data asset uri
from mltable import load
tbl = load(
'azureml://subscriptions/<subscription-id>/
resourcegroups/<resourcegroup-name>/providers/Microsoft.MachineLearningServices/
workspaces/<workspace-name>/data/<data-asset-name>/versions/<data-asset-version>/')
# load mltable from azureml data asset short uri
from mltable import load
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
ml_client = MLClient(credential, <subscription_id>, <resourcegroup-name>, <workspace-name>)
tbl = load('azureml:<data-asset-name>:<version>', ml_client=ml_client)