mltable Package

Reference

Contains functionality for interacting with existing and creating new MLTable files.

With the mltable package you can load, transform, and analyze data in any Python environment, including Jupyter Notebooks or your favorite Python IDE.

Packages

tests

Modules

mltable

Contains functionality to create and interact with MLTable objects

Classes

DataType

Helper class for handling the proper manipulation of supported column types (int, bool, string, etc.). Currently used with MLTable.convert_column_types(...) & from_delimited_files(...) for specifying which types to convert columns to. Different types are selected with DataType.from_(...)* methods.

MLTable

Represents a MLTable.

A MLTable defines a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data.

Initialize a new MLTable.

This constructor is not supposed to be invoked directly. MLTable is intended to be created using load.

Enums

MLTableFileEncoding

Defines options for how encoding are processed when reading data from files to create a MLTable.

These enumeration values are used in the MLTable class.

MLTableHeaders

Defines options for how column headers are processed when reading data from files to create a MLTable.

These enumeration values are used in the MLTable class.

Functions

from_delimited_files

Creates a MLTable from the given list of delimited files.

from_delimited_files(paths, header='all_files_same_headers', delimiter=',', support_multi_line=False, empty_as_string=False, encoding='utf8', include_path_column=False, infer_column_types=True)

Parameters

Name	Description
paths Required	list[dict[str, str]] Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.
header Required	Union[str, MLTableHeaders] How column headers are handled when reading from files. Options specified using the enum MLTableHeaders. Supported headers are 'no_header', 'from_first_file', 'all_files_different_headers', and 'all_files_same_headers'.
delimiter Required	str separator used to split columns
support_multi_line Required	bool If False, all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks. Given this csv file as example, the data will be read differently based on support_multi_line. A,B,C A1,B1,C1 A2,"B 2",C2 `from mltable import from_delimited_files # default behavior: support_multi_line=False mltable = from_delimited_files(path) print(mltable.to_pandas_dataframe()) # A B C # 0 A1 B1 C1 # 1 A2 B None # 2 2" C2 None # to handle quoted line breaks mltable = from_delimited_files(path, support_multi_line=True) print(mltable.to_pandas_dataframe()) # A B C # 0 A1 B1 C1 # 1 A2 B\r\n2 C2`
empty_as_string Required	bool How empty fields should be handled. If True will read empty fields as empty strings, else read as nulls. If True and column contains datetime or numeric data, empty fields still read as nulls.
encoding Required	Union[str, MLTableFileEncoding] Specifies the file encoding using the enum MLTableFileEncoding. Supported encodings are: utf8 as "utf8", "utf-8", "utf-8 bom" iso88591 as "iso88591" or "iso-8859-1" latin1 as "latin1" or "latin-1" utf16 as "utf16" or "utf-16" windows1252 as "windows1252" or "windows-1252"
include_path_column Required	bool Keep path information as a column in the MLTable, is useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.
infer_column_types Required	Union[bool, dict[str, Union[str, dict[Union[Tuple[str], str], DataType]]] If True, automatically infers all column types. If False, leaves columns as strings. If a dictionary, represents columns whose types are to be set to given types (with all other columns being inferred). The dictionary may contain a key named sample_size mapped to a positive integer number, representing the number of rows to use for inferring column types. The dictionary may also contain a key named 'column_type_overrides'. Each key in the dictionary is either a string representing a column name or a tuple of strings representing a group of column names. Each value is either a string (one of 'boolean', 'string', 'float', or 'int') or a DataType. mltable.DataType.to_stream() is not supported. If an empty dictionary is given, assumed to be True. Defaults to True. An example of how to format infer_column_types. `from mltable import from_delimited_files # default behavior: support_multi_line=False mltable = from_delimited_files(paths, infer_column_types={ 'sample_size': 100, 'column_type_overrides': { 'colA': 'boolean' ('colB', 'colC'): DataType.to_int() } })`

Returns

Type	Description
MLTable	MLTable

Remarks

There must be a valid paths string.


   # load mltable from local delimited file
   from mltable import from_delimited_files
   paths = [{"file": "./samples/mltable_sample/sample_data.csv"}]
   mltable = from_delimited_files(paths)

from_delta_lake

Creates an MLTable object to read in Parquet files from delta lake table.

from_delta_lake(delta_table_uri, timestamp_as_of=None, version_as_of=None, include_path_column=False)

Parameters

Name	Description
delta_table_uri Required	str URI pointing to the delta table directory containing the delta lake parquet files to read. Supported URI types are: local path URI, storage URI, long-form datastore URI, or data asset uri.
timestamp_as_of Required	string datetime string in RFC-3339/ISO-8601 format to use to read in matching parquet files from a specific point in time. ex) "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00", "2022-10-01T01:30:00-08:00"
version_as_of Required	int integer version to use to read in a specific version of parquet files.
include_path_column Required	bool Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

Returns

Type	Description
MLTable	MLTable instance

Remarks

from_delta_lake creates an MLTable object which defines the operations to load data from delta lake folder into tabular representation.

For the data to be accessible by Azure Machine Learning, path must point to the delta table directory and the delta lake files that are referenced must be accessible by AzureML services or behind public web urls.

from_delta_lake supports reading delta lake data from a uri pointing to: local path, Blob, ADLS Gen1 and ADLS Gen2

Users are able to read in and materialize the data by calling to_pandas_dataframe() on the returned MLTable


   # create an MLTable object from a delta lake using timestamp versioning and materialize the data
   from mltable import from_delta_lake
   mltable_ts = from_delta_lake(delta_table_uri="./data/delta-01", timestamp_as_of="2021-05-24T00:00:00Z")
   pd = mltable_ts.to_pandas_dataframe()

   # create  an MLTable object from a delta lake using integer versioning and materialize the data
   from mltable import from_delta_lake
   mltable_version = from_delta_lake(delta_table_uri="./data/delta-02", version_as_of=1)
   pd = mltable_version.to_pandas_dataframe()

from_json_lines_files

Create a MLTable from the given list of JSON file paths.

from_json_lines_files(paths, invalid_lines='error', encoding='utf8', include_path_column=False)

Parameters

Name	Description
paths Required	list[dict[str, str]] Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.
invalid_lines Required	str How to handle lines that are invalid JSON, can be 'drop' or 'error'. If 'drop' invalid lines are dropped, else error is raised.
encoding Required	Union[str, MLTableFileEncoding] Specifies the file encoding using the enum MLTableFileEncoding. Supported file encodings: utf8 as "utf8", "utf-8", "utf-8 bom" iso88591 as "iso88591" or "iso-8859-1" latin1 as "latin1" or "latin-1" utf16 as "utf16" or "utf-16" windows1252 as "windows1252" or "windows-1252"
include_path_column Required	bool Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

Returns

Type	Description
MLTable	MLTable

Remarks

There must be a valid paths dictionary


   # load mltable from local JSON paths
   from mltable import from_json_lines_files
   paths = [{'file': './samples/mltable_sample/sample_data.jsonl'}]
   mltable = from_json_lines_files(paths)

from_parquet_files

Create the MLTable from the given list of parquet files.

from_parquet_files(paths, include_path_column=False)

Parameters

Name	Description
paths Required	list[dict[str, str]] Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.
include_path_column Required	bool Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

Returns

Type	Description
MLTable	MLTable instance

Remarks

There must be a valid paths dictionary


   # load mltable from local parquet paths
   from mltable import from_parquet_files
   paths = [{'file': './samples/mltable_sample/sample.parquet'}]
   mltable = from_parquet_files(paths)

from_paths

Create the MLTable from the given paths.

from_paths(paths)

Parameters

Name	Description
paths Required	list[dict[str, str]] Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.

Returns

Type	Description
MLTable	MLTable instance

Remarks

There must be a valid paths dictionary


   # load mltable from local paths
   from mltable import from_paths
   tbl = from_paths([{'file': "./samples/mltable_sample"}])

   # load mltable from cloud paths
   from mltable import load
   tbl = from_paths(
       [{'file': "https://<blob-storage-name>.blob.core.windows.net/<path>/sample_file"}])

load

Loads the MLTable file (YAML) present at the given uri.

storage_options supports keys of 'subscription', 'resource_group', 'workspace', or 'location'. All must locate an Azure machine learning workspace.

load(uri, storage_options: dict = None, ml_client=None)

Parameters

Name	Description
uri Required	str uri supports long-form datastore uri, storage uri, local path, or data asset uri or data asset short uri
storage_options Required	dict[str, str] AML workspace info when URI is an AML asset
ml_client Required	MLClient MLClient instance. To learn more, see https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.mlclient?view=azure-python

Returns

Type	Description
MLTable	MLTable

Remarks

There must be a valid MLTable YAML file named 'MLTable' present at the given uri.


   # load mltable from local folder
   from mltable import load
   tbl = load('.\samples\mltable_sample')

   # load mltable from azureml datastore uri
   from mltable import load
   tbl = load(
       'azureml://subscriptions/<subscription-id>/
       resourcegroups/<resourcegroup-name>/workspaces/<workspace-name>/
       datastores/<datastore-name>/paths/<mltable-path-on-datastore>/')

   # load mltable from azureml data asset uri
   from mltable import load
   tbl = load(
         'azureml://subscriptions/<subscription-id>/
         resourcegroups/<resourcegroup-name>/providers/Microsoft.MachineLearningServices/
         workspaces/<workspace-name>/data/<data-asset-name>/versions/<data-asset-version>/')

   # load mltable from azureml data asset short uri
   from mltable import load
   from azure.ai.ml import MLClient
   from azure.identity import DefaultAzureCredential
   credential = DefaultAzureCredential()
   ml_client = MLClient(credential, <subscription_id>, <resourcegroup-name>, <workspace-name>)
   tbl = load('azureml:<data-asset-name>:<version>', ml_client=ml_client)

Share via

mltable Package

Packages

Modules

Classes

Enums

Functions

from_delimited_files

Parameters

Returns

Remarks

from_delta_lake

Parameters

Returns

Remarks

from_json_lines_files

Parameters

Returns

Remarks

from_parquet_files

Parameters

Returns

Remarks

from_paths

Parameters

Returns

Remarks

load

Parameters

Returns

Remarks

Feedback

Feedback

Additional resources