MLTable Class

Represents a MLTable.

A MLTable defines a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data.

Initialize a new MLTable.

This constructor is not supposed to be invoked directly. MLTable is intended to be created using load.

Inheritance
builtins.object
MLTable

Constructor

MLTable()

Methods

convert_column_types

Adds a transformation step to convert the specified columns into their respective specified new types.


   from mltable import DataType
       data_types = {
           'ID': DataType.to_string(),
           'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
           'Count': DataType.to_int(),
           'Latitude': DataType.to_float(),
           'Found': DataType.to_bool(),
           'Stream': DataType.to_stream()
       }
drop_columns

Adds a transformation step to drop the given columns from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.

Attempting to drop a column that is MLTable.traits.timestamp_column or in MLTable.traits.index_columns will raise a UserErrorException.

extract_columns_from_partition_format

Adds a transformation step to use the partition information of each path and extract them into columns based on the specified partition format.

Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type.

The format should start from the position of first partition key until the end of file path. For example, given the path '/Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

filter

Filter the data, leaving only the records that match the specified expression.

get_partition_count

Returns the number of data partitions underlying the data associated with this MLTable.

keep_columns

Adds a transformation step to keep the specified columns and drop all others from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.

If column in MLTable.traits.timestamp_column or columns in MLTable.traits.index_columns are not explicitly kept, a UserErrorException is raiesd.

random_split

Randomly splits this MLTable into two MLTables, one having approximately "percent"% of the original MLTable's data and the other having the remainder (1-"percent"%).

save

Save this MLTable as a MLTable YAML file & its assoicated paths to the given directory path.

If path is not given, defaults to the current working directory. If path does not exist, it is created. If path is remote, the underlying data store must already exist. If path is a local directory & is not absolute, it is made absolute.

If path points to a file, a UserErrorException is raised. If path is a directory path that already contain one or more files being saved (including the MLTable YAML file) and overwrite is set to False or 'fail' - a UserErrorException is raised. If path is remote, any local files paths not given as a colocated path (file path relative to the directory that MLTable was loaded from) will raise a UserErrorException.

colocated controls how associated paths are saved to path. If True, files are copied to path alongside the MLTable YAML file as relative file paths. Otherwise associated files are not copied, remote paths remain as given and local file paths are made relative with path redirection if needed. Note that False may result in noncolocated MLTable YAML files which is not recommended, furthermore if path is remote this will result in a UserErrorException as relative path redirection is not supported for remote URIs.

Note that if the MLTable is created programatically with methods like from_paths() or from_read_delimited_files() with local relative paths, the MLTable directory path is assumed to be the current working directory.

Be mindful when saving a new MLTable & associated data files to a directory with an existing MLTable file & associated data files that the directory is not cleared of existing files before saving the new files. It is possible for already existing data files to persist after saving the new files, especially if existing data files do not have names matching any new data files. If the new MLTable contains a pattern designator under its paths, this may unintentionally alter the MLTable by associating existing data files with the new MLTable.

If file paths in this MLTable point to an existing file in path but have a different URI, if overwrite is 'fail' or 'skip' the existing file will not be overwritten (i.e. skipped).

select_partitions

Adds a transformation step to select the partition.

show

Retrieves the first count rows of this MLTable as a Pandas Dataframe.

skip

Adds a transformation step to skip the first count rows of this MLTable.

take

Adds a transformation step to select the first count rows of this MLTable.

take_random_sample

Adds a transformation step to randomly select each row of this MLTable with probability chance. Probability must be in range [0, 1]. May optionally set a random seed.

to_pandas_dataframe

Load all records from the paths specified in the MLTable file into a Pandas DataFrame.

validate

Validates if this MLTable's data can be loaded, requires the MLTable's data source(s) to be accessible from the current compute.

convert_column_types

Adds a transformation step to convert the specified columns into their respective specified new types.


   from mltable import DataType
       data_types = {
           'ID': DataType.to_string(),
           'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
           'Count': DataType.to_int(),
           'Latitude': DataType.to_float(),
           'Found': DataType.to_bool(),
           'Stream': DataType.to_stream()
       }
convert_column_types(column_types)

Parameters

column_types
dict[Union[Tuple[str], str], DataType]
Required

Dictionary of column: types the user desires to convert

Returns

MLTable with added transformation step

Return type

drop_columns

Adds a transformation step to drop the given columns from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.

Attempting to drop a column that is MLTable.traits.timestamp_column or in MLTable.traits.index_columns will raise a UserErrorException.

drop_columns(columns: str | List[str] | Tuple[str] | Set[str])

Parameters

columns
Union[str, list[str], <xref:builtin.tuple>[str], <xref:builtin.set>[str]]
Required

column(s) to drop from this MLTable

Returns

MLTable with added transformation step

Return type

extract_columns_from_partition_format

Adds a transformation step to use the partition information of each path and extract them into columns based on the specified partition format.

Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type.

The format should start from the position of first partition key until the end of file path. For example, given the path '/Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

extract_columns_from_partition_format(partition_format)

Parameters

partition_format
str
Required

Partition format to use to extract data into columns

Returns

MLTable whose partition format is set to given format

Return type

filter

Filter the data, leaving only the records that match the specified expression.

filter(expression)

Parameters

expression
string
Required

The expression to evaluate.

Returns

MLTable after filter

Return type

Remarks

Expressions are started by indexing the mltable with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.


   filtered_mltable = mltable.filter('feature_1 == "5" and target > "0.5)"')
   filtered_mltable = mltable.filter('col("FBI Code") == "11"')

get_partition_count

Returns the number of data partitions underlying the data associated with this MLTable.

get_partition_count() -> int

Returns

data partitions in this MLTable

Return type

int

keep_columns

Adds a transformation step to keep the specified columns and drop all others from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.

If column in MLTable.traits.timestamp_column or columns in MLTable.traits.index_columns are not explicitly kept, a UserErrorException is raiesd.

keep_columns(columns: str | List[str] | Tuple[str] | Set[str])

Parameters

columns
Union[str, list[str], <xref:builtin.tuple>[str], <xref:builtin.set>[str]]
Required

columns(s) in this MLTable to keep

Returns

MLTable with added transformation step

Return type

random_split

Randomly splits this MLTable into two MLTables, one having approximately "percent"% of the original MLTable's data and the other having the remainder (1-"percent"%).

random_split(percent=0.5, seed=None)

Parameters

percent
Union[int, float]
Required

percent of the MLTable to split between

seed
Optional[int]
Required

optional random seed

Returns

two MLTables with this MLTable's data split between them by "percent"

Return type

<xref:Tuple>[MLTable, MLTable]

save

Save this MLTable as a MLTable YAML file & its assoicated paths to the given directory path.

If path is not given, defaults to the current working directory. If path does not exist, it is created. If path is remote, the underlying data store must already exist. If path is a local directory & is not absolute, it is made absolute.

If path points to a file, a UserErrorException is raised. If path is a directory path that already contain one or more files being saved (including the MLTable YAML file) and overwrite is set to False or 'fail' - a UserErrorException is raised. If path is remote, any local files paths not given as a colocated path (file path relative to the directory that MLTable was loaded from) will raise a UserErrorException.

colocated controls how associated paths are saved to path. If True, files are copied to path alongside the MLTable YAML file as relative file paths. Otherwise associated files are not copied, remote paths remain as given and local file paths are made relative with path redirection if needed. Note that False may result in noncolocated MLTable YAML files which is not recommended, furthermore if path is remote this will result in a UserErrorException as relative path redirection is not supported for remote URIs.

Note that if the MLTable is created programatically with methods like from_paths() or from_read_delimited_files() with local relative paths, the MLTable directory path is assumed to be the current working directory.

Be mindful when saving a new MLTable & associated data files to a directory with an existing MLTable file & associated data files that the directory is not cleared of existing files before saving the new files. It is possible for already existing data files to persist after saving the new files, especially if existing data files do not have names matching any new data files. If the new MLTable contains a pattern designator under its paths, this may unintentionally alter the MLTable by associating existing data files with the new MLTable.

If file paths in this MLTable point to an existing file in path but have a different URI, if overwrite is 'fail' or 'skip' the existing file will not be overwritten (i.e. skipped).

save(path=None, overwrite=True, colocated=False, show_progress=False, if_err_remove_files=True)

Parameters

path
str
Required

directory path to save to, default to current working directory

colocated
bool
Required

If True, saves copies of local & remote file paths in this MLTable under path as relative paths. Otherwise no file copying occurs and remote file paths are saved as given to the saved MLTable YAML file and local file paths as relative file paths with path redirection. If path is remote & this MLTable contains local file paths, a UserErrorException will be raised.

overwrite
Union[bool, str, <xref:mltable.MLTableSaveOverwriteOptions>]
Required

How existing an MLTable YAML file and associated files that may already exist under path are handled. Options are 'overwrite' (or True) to replace any existing files, 'fail' (or False) to raise an error if a file already exists, or 'skip' to leave existing files as is. May also set with <xref:mltable.MLTableSaveOverwriteOptions>.

show_progress
bool
Required

displays copying progress to stdout

if_err_remove_files
bool
Required

if any error occurs during saving, removed any successfully saved files to make the operation atomic

Returns

this MLTable instance

Return type

select_partitions

Adds a transformation step to select the partition.

select_partitions(partition_index_list)

Parameters

partition_index_list
list of int
Required

list of partition index

Returns

MLTable with partition size updated

Return type

Remarks

The following code snippet shows how to use the select_partitions api to selected partitions from the provided MLTable.


   partition_index_list = [1, 2]
   mltable = mltable.select_partitions(partition_index_list)

show

Retrieves the first count rows of this MLTable as a Pandas Dataframe.

show(count=20)

Parameters

count
int
Required

number of rows from top of table to select

Returns

first count rows of the MLTable

Return type

<xref:<xref:Pandas Dataframe>>

skip

Adds a transformation step to skip the first count rows of this MLTable.

skip(count)

Parameters

count
int
Required

number of rows to skip

Returns

MLTable with added transformation step

take

Adds a transformation step to select the first count rows of this MLTable.

take(count=20)

Parameters

count
int
Required

number of rows from top of table to select

Returns

MLTable with added "take" transformation step

Return type

take_random_sample

Adds a transformation step to randomly select each row of this MLTable with probability chance. Probability must be in range [0, 1]. May optionally set a random seed.

take_random_sample(probability, seed=None)

Parameters

probability
Required

chance that each row is selected

seed
Optional[int]
Required

optional random seed

Returns

MLTable with added transformation step

Return type

to_pandas_dataframe

Load all records from the paths specified in the MLTable file into a Pandas DataFrame.

to_pandas_dataframe()

Returns

Pandas Dataframe containing the records from paths in this MLTable

Return type

Remarks

The following code snippet shows how to use the to_pandas_dataframe api to obtain a pandas dataframe corresponding to the provided MLTable.


   from mltable import load
   tbl = load('.\samples\mltable_sample')
   pdf = tbl.to_pandas_dataframe()
   print(pdf.shape)

validate

Validates if this MLTable's data can be loaded, requires the MLTable's data source(s) to be accessible from the current compute.

validate()

Returns

None

Return type

Attributes

partition_keys

Return the partition keys.

Returns

the partition keys

Return type

paths

Returns a list of dictionaries containing the original paths given to this MLTable. Relative local file paths are assumed to be relative to the directory where the MLTable YAML file this MLTable instance was loaded from.

Returns

list of dicts containing paths specified in the MLTable

Return type