mltable Package

Contains functionality for interacting with existing and creating new MLTable files.

With the mltable package you can load, transform, and analyze data in any Python environment, including Jupyter Notebooks or your favorite Python IDE.

Packages

tests

Modules

mltable

Contains functionality to create and interact with MLTable objects

Classes

DataType

Helper class for handling the proper manipulation of supported column types (int, bool, string, etc.). Currently used with MLTable.convert_column_types(...) & from_delimited_files(...) for specifying which types to convert columns to. Different types are selected with DataType.from_(...)* methods.

MLTable

Represents a MLTable.

A MLTable defines a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data.

Initialize a new MLTable.

This constructor is not supposed to be invoked directly. MLTable is intended to be created using load.

Enums

MLTableFileEncoding

Defines options for how encoding are processed when reading data from files to create a MLTable.

These enumeration values are used in the MLTable class.

MLTableHeaders

Defines options for how column headers are processed when reading data from files to create a MLTable.

These enumeration values are used in the MLTable class.

Functions

from_delimited_files

Creates a MLTable from the given list of delimited files.

from_delimited_files(paths, header='all_files_same_headers', delimiter=',', support_multi_line=False, empty_as_string=False, encoding='utf8', include_path_column=False, infer_column_types=True)

Parameters

Name Description
paths
Required

Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.

header
Required

How column headers are handled when reading from files. Options specified using the enum MLTableHeaders. Supported headers are 'no_header', 'from_first_file', 'all_files_different_headers', and 'all_files_same_headers'.

delimiter
Required
str

separator used to split columns

support_multi_line
Required

If False, all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks.

Given this csv file as example, the data will be read differently based on support_multi_line.

A,B,C A1,B1,C1 A2,"B 2",C2


   from mltable import from_delimited_files

   # default behavior: support_multi_line=False
   mltable = from_delimited_files(path)
   print(mltable.to_pandas_dataframe())
   #      A   B     C
   #  0  A1  B1    C1
   #  1  A2   B  None
   #  2  2"  C2  None

   # to handle quoted line breaks
   mltable = from_delimited_files(path, support_multi_line=True)
   print(mltable.to_pandas_dataframe())
   #      A       B   C
   #  0  A1      B1  C1
   #  1  A2  B\r\n2  C2
empty_as_string
Required

How empty fields should be handled. If True will read empty fields as empty strings, else read as nulls. If True and column contains datetime or numeric data, empty fields still read as nulls.

encoding
Required

Specifies the file encoding using the enum MLTableFileEncoding. Supported encodings are:

  • utf8 as "utf8", "utf-8", "utf-8 bom"
  • iso88591 as "iso88591" or "iso-8859-1"
  • latin1 as "latin1" or "latin-1"
  • utf16 as "utf16" or "utf-16"
  • windows1252 as "windows1252" or "windows-1252"
include_path_column
Required

Keep path information as a column in the MLTable, is useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

infer_column_types
Required

If True, automatically infers all column types. If False, leaves columns as strings. If a dictionary, represents columns whose types are to be set to given types (with all other columns being inferred). The dictionary may contain a key named sample_size mapped to a positive integer number, representing the number of rows to use for inferring column types. The dictionary may also contain a key named 'column_type_overrides'. Each key in the dictionary is either a string representing a column name or a tuple of strings representing a group of column names. Each value is either a string (one of 'boolean', 'string', 'float', or 'int') or a DataType. mltable.DataType.to_stream() is not supported. If an empty dictionary is given, assumed to be True. Defaults to True.

An example of how to format infer_column_types.


   from mltable import from_delimited_files

   # default behavior: support_multi_line=False
   mltable = from_delimited_files(paths, infer_column_types={
       'sample_size': 100,
       'column_type_overrides': {
           'colA': 'boolean'
           ('colB', 'colC'): DataType.to_int()
       }
   })

Returns

Type Description

MLTable

Remarks

There must be a valid paths string.


   # load mltable from local delimited file
   from mltable import from_delimited_files
   paths = [{"file": "./samples/mltable_sample/sample_data.csv"}]
   mltable = from_delimited_files(paths)

from_delta_lake

Creates an MLTable object to read in Parquet files from delta lake table.

from_delta_lake(delta_table_uri, timestamp_as_of=None, version_as_of=None, include_path_column=False)

Parameters

Name Description
delta_table_uri
Required
str

URI pointing to the delta table directory containing the delta lake parquet files to read. Supported URI types are: local path URI, storage URI, long-form datastore URI, or data asset uri.

timestamp_as_of
Required

datetime string in RFC-3339/ISO-8601 format to use to read in matching parquet files from a specific point in time. ex) "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00", "2022-10-01T01:30:00-08:00"

version_as_of
Required
int

integer version to use to read in a specific version of parquet files.

include_path_column
Required

Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

Returns

Type Description

MLTable instance

Remarks

from_delta_lake creates an MLTable object which defines the operations to load data from delta lake folder into tabular representation.

For the data to be accessible by Azure Machine Learning, path must point to the delta table directory and the delta lake files that are referenced must be accessible by AzureML services or behind public web urls.

from_delta_lake supports reading delta lake data from a uri pointing to: local path, Blob, ADLS Gen1 and ADLS Gen2

Users are able to read in and materialize the data by calling to_pandas_dataframe() on the returned MLTable


   # create an MLTable object from a delta lake using timestamp versioning and materialize the data
   from mltable import from_delta_lake
   mltable_ts = from_delta_lake(delta_table_uri="./data/delta-01", timestamp_as_of="2021-05-24T00:00:00Z")
   pd = mltable_ts.to_pandas_dataframe()

   # create  an MLTable object from a delta lake using integer versioning and materialize the data
   from mltable import from_delta_lake
   mltable_version = from_delta_lake(delta_table_uri="./data/delta-02", version_as_of=1)
   pd = mltable_version.to_pandas_dataframe()

from_json_lines_files

Create a MLTable from the given list of JSON file paths.

from_json_lines_files(paths, invalid_lines='error', encoding='utf8', include_path_column=False)

Parameters

Name Description
paths
Required

Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.

invalid_lines
Required
str

How to handle lines that are invalid JSON, can be 'drop' or 'error'. If 'drop' invalid lines are dropped, else error is raised.

encoding
Required

Specifies the file encoding using the enum MLTableFileEncoding. Supported file encodings:

  • utf8 as "utf8", "utf-8", "utf-8 bom"
  • iso88591 as "iso88591" or "iso-8859-1"
  • latin1 as "latin1" or "latin-1"
  • utf16 as "utf16" or "utf-16"
  • windows1252 as "windows1252" or "windows-1252"
include_path_column
Required

Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

Returns

Type Description

MLTable

Remarks

There must be a valid paths dictionary


   # load mltable from local JSON paths
   from mltable import from_json_lines_files
   paths = [{'file': './samples/mltable_sample/sample_data.jsonl'}]
   mltable = from_json_lines_files(paths)

from_parquet_files

Create the MLTable from the given list of parquet files.

from_parquet_files(paths, include_path_column=False)

Parameters

Name Description
paths
Required

Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.

include_path_column
Required

Keep path information as a column, useful when reading multiple files and you want to know which file a particular record came from, or to keep useful information that may be stored in a file path.

Returns

Type Description

MLTable instance

Remarks

There must be a valid paths dictionary


   # load mltable from local parquet paths
   from mltable import from_parquet_files
   paths = [{'file': './samples/mltable_sample/sample.parquet'}]
   mltable = from_parquet_files(paths)

from_paths

Create the MLTable from the given paths.

from_paths(paths)

Parameters

Name Description
paths
Required

Paths supports files or folders with local or cloud paths. Relative local file paths are assumed to be relative to the current working directory. If the parent directory a local file path is relative to is not the current working directory, instead recommend passing that path as a absolute file path.

Returns

Type Description

MLTable instance

Remarks

There must be a valid paths dictionary


   # load mltable from local paths
   from mltable import from_paths
   tbl = from_paths([{'file': "./samples/mltable_sample"}])

   # load mltable from cloud paths
   from mltable import load
   tbl = from_paths(
       [{'file': "https://<blob-storage-name>.blob.core.windows.net/<path>/sample_file"}])

load

Loads the MLTable file (YAML) present at the given uri.

storage_options supports keys of 'subscription', 'resource_group', 'workspace', or 'location'. All must locate an Azure machine learning workspace.

load(uri, storage_options: dict = None, ml_client=None)

Parameters

Name Description
uri
Required
str

uri supports long-form datastore uri, storage uri, local path, or data asset uri or data asset short uri

storage_options
Required

AML workspace info when URI is an AML asset

ml_client
Required

Returns

Type Description

MLTable

Remarks

There must be a valid MLTable YAML file named 'MLTable' present at the given uri.


   # load mltable from local folder
   from mltable import load
   tbl = load('.\samples\mltable_sample')

   # load mltable from azureml datastore uri
   from mltable import load
   tbl = load(
       'azureml://subscriptions/<subscription-id>/
       resourcegroups/<resourcegroup-name>/workspaces/<workspace-name>/
       datastores/<datastore-name>/paths/<mltable-path-on-datastore>/')

   # load mltable from azureml data asset uri
   from mltable import load
   tbl = load(
         'azureml://subscriptions/<subscription-id>/
         resourcegroups/<resourcegroup-name>/providers/Microsoft.MachineLearningServices/
         workspaces/<workspace-name>/data/<data-asset-name>/versions/<data-asset-version>/')

   # load mltable from azureml data asset short uri
   from mltable import load
   from azure.ai.ml import MLClient
   from azure.identity import DefaultAzureCredential
   credential = DefaultAzureCredential()
   ml_client = MLClient(credential, <subscription_id>, <resourcegroup-name>, <workspace-name>)
   tbl = load('azureml:<data-asset-name>:<version>', ml_client=ml_client)