MLTable Class
Represents a MLTable.
A MLTable defines a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data.
Initialize a new MLTable.
This constructor is not supposed to be invoked directly. MLTable is intended to be created using load.
- Inheritance
-
builtins.objectMLTable
Constructor
MLTable()
Methods
convert_column_types |
Adds a transformation step to convert the specified columns into their respective specified new types.
|
drop_columns |
Adds a transformation step to drop the given columns from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException. Attempting to drop a column that is MLTable.traits.timestamp_column or in MLTable.traits.index_columns will raise a UserErrorException. |
extract_columns_from_partition_format |
Adds a transformation step to use the partition information of each path and extract them into columns based on the specified partition format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '/Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. |
filter |
Filter the data, leaving only the records that match the specified expression. |
get_partition_count |
Returns the number of data partitions underlying the data associated with this MLTable. |
keep_columns |
Adds a transformation step to keep the specified columns and drop all others from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException. If column in MLTable.traits.timestamp_column or columns in MLTable.traits.index_columns are not explicitly kept, a UserErrorException is raiesd. |
random_split |
Randomly splits this MLTable into two MLTables, one having approximately "percent"% of the original MLTable's data and the other having the remainder (1-"percent"%). |
save |
Save this MLTable as a MLTable YAML file & its assoicated paths to the given directory path. If path is not given, defaults to the current working directory. If path does not exist, it is created. If path is remote, the underlying data store must already exist. If path is a local directory & is not absolute, it is made absolute. If path points to a file, a UserErrorException is raised. If path is a directory path that already contain one or more files being saved (including the MLTable YAML file) and overwrite is set to False or 'fail' - a UserErrorException is raised. If path is remote, any local files paths not given as a colocated path (file path relative to the directory that MLTable was loaded from) will raise a UserErrorException. colocated controls how associated paths are saved to path. If True, files are copied to path alongside the MLTable YAML file as relative file paths. Otherwise associated files are not copied, remote paths remain as given and local file paths are made relative with path redirection if needed. Note that False may result in noncolocated MLTable YAML files which is not recommended, furthermore if path is remote this will result in a UserErrorException as relative path redirection is not supported for remote URIs. Note that if the MLTable is created programatically with methods like from_paths() or from_read_delimited_files() with local relative paths, the MLTable directory path is assumed to be the current working directory. Be mindful when saving a new MLTable & associated data files to a directory with an existing MLTable file & associated data files that the directory is not cleared of existing files before saving the new files. It is possible for already existing data files to persist after saving the new files, especially if existing data files do not have names matching any new data files. If the new MLTable contains a pattern designator under its paths, this may unintentionally alter the MLTable by associating existing data files with the new MLTable. If file paths in this MLTable point to an existing file in path but have a different URI, if overwrite is 'fail' or 'skip' the existing file will not be overwritten (i.e. skipped). |
select_partitions |
Adds a transformation step to select the partition. |
show |
Retrieves the first count rows of this MLTable as a Pandas Dataframe. |
skip |
Adds a transformation step to skip the first count rows of this MLTable. |
take |
Adds a transformation step to select the first count rows of this MLTable. |
take_random_sample |
Adds a transformation step to randomly select each row of this MLTable with probability chance. Probability must be in range [0, 1]. May optionally set a random seed. |
to_pandas_dataframe |
Load all records from the paths specified in the MLTable file into a Pandas DataFrame. |
validate |
Validates if this MLTable's data can be loaded, requires the MLTable's data source(s) to be accessible from the current compute. |
convert_column_types
Adds a transformation step to convert the specified columns into their respective specified new types.
from mltable import DataType
data_types = {
'ID': DataType.to_string(),
'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
'Count': DataType.to_int(),
'Latitude': DataType.to_float(),
'Found': DataType.to_bool(),
'Stream': DataType.to_stream()
}
convert_column_types(column_types)
Parameters
Name | Description |
---|---|
column_types
Required
|
Dictionary of column: types the user desires to convert |
Returns
Type | Description |
---|---|
MLTable with added transformation step |
drop_columns
Adds a transformation step to drop the given columns from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.
Attempting to drop a column that is MLTable.traits.timestamp_column or in MLTable.traits.index_columns will raise a UserErrorException.
drop_columns(columns: str | List[str] | Tuple[str] | Set[str])
Parameters
Name | Description |
---|---|
columns
Required
|
column(s) to drop from this MLTable |
Returns
Type | Description |
---|---|
MLTable with added transformation step |
extract_columns_from_partition_format
Adds a transformation step to use the partition information of each path and extract them into columns based on the specified partition format.
Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type.
The format should start from the position of first partition key until the end of file path. For example, given the path '/Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
extract_columns_from_partition_format(partition_format)
Parameters
Name | Description |
---|---|
partition_format
Required
|
Partition format to use to extract data into columns |
Returns
Type | Description |
---|---|
MLTable whose partition format is set to given format |
filter
Filter the data, leaving only the records that match the specified expression.
filter(expression)
Parameters
Name | Description |
---|---|
expression
Required
|
The expression to evaluate. |
Returns
Type | Description |
---|---|
MLTable after filter |
Remarks
Expressions are started by indexing the mltable with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.
filtered_mltable = mltable.filter('feature_1 == "5" and target > "0.5)"')
filtered_mltable = mltable.filter('col("FBI Code") == "11"')
get_partition_count
Returns the number of data partitions underlying the data associated with this MLTable.
get_partition_count() -> int
Returns
Type | Description |
---|---|
data partitions in this MLTable |
keep_columns
Adds a transformation step to keep the specified columns and drop all others from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.
If column in MLTable.traits.timestamp_column or columns in MLTable.traits.index_columns are not explicitly kept, a UserErrorException is raiesd.
keep_columns(columns: str | List[str] | Tuple[str] | Set[str])
Parameters
Name | Description |
---|---|
columns
Required
|
columns(s) in this MLTable to keep |
Returns
Type | Description |
---|---|
MLTable with added transformation step |
random_split
Randomly splits this MLTable into two MLTables, one having approximately "percent"% of the original MLTable's data and the other having the remainder (1-"percent"%).
random_split(percent=0.5, seed=None)
Parameters
Name | Description |
---|---|
percent
Required
|
percent of the MLTable to split between |
seed
Required
|
optional random seed |
Returns
Type | Description |
---|---|
two MLTables with this MLTable's data split between them by "percent" |
save
Save this MLTable as a MLTable YAML file & its assoicated paths to the given directory path.
If path is not given, defaults to the current working directory. If path does not exist, it is created. If path is remote, the underlying data store must already exist. If path is a local directory & is not absolute, it is made absolute.
If path points to a file, a UserErrorException is raised. If path is a directory path that already contain one or more files being saved (including the MLTable YAML file) and overwrite is set to False or 'fail' - a UserErrorException is raised. If path is remote, any local files paths not given as a colocated path (file path relative to the directory that MLTable was loaded from) will raise a UserErrorException.
colocated controls how associated paths are saved to path. If True, files are copied to path alongside the MLTable YAML file as relative file paths. Otherwise associated files are not copied, remote paths remain as given and local file paths are made relative with path redirection if needed. Note that False may result in noncolocated MLTable YAML files which is not recommended, furthermore if path is remote this will result in a UserErrorException as relative path redirection is not supported for remote URIs.
Note that if the MLTable is created programatically with methods like from_paths() or from_read_delimited_files() with local relative paths, the MLTable directory path is assumed to be the current working directory.
Be mindful when saving a new MLTable & associated data files to a directory with an existing MLTable file & associated data files that the directory is not cleared of existing files before saving the new files. It is possible for already existing data files to persist after saving the new files, especially if existing data files do not have names matching any new data files. If the new MLTable contains a pattern designator under its paths, this may unintentionally alter the MLTable by associating existing data files with the new MLTable.
If file paths in this MLTable point to an existing file in path but have a different URI, if overwrite is 'fail' or 'skip' the existing file will not be overwritten (i.e. skipped).
save(path=None, overwrite=True, colocated=False, show_progress=False, if_err_remove_files=True)
Parameters
Name | Description |
---|---|
path
Required
|
directory path to save to, default to current working directory |
colocated
Required
|
If True, saves copies of local & remote file paths in this MLTable under path as relative paths. Otherwise no file copying occurs and remote file paths are saved as given to the saved MLTable YAML file and local file paths as relative file paths with path redirection. If path is remote & this MLTable contains local file paths, a UserErrorException will be raised. |
overwrite
Required
|
Union[bool, str, <xref:mltable.MLTableSaveOverwriteOptions>]
How existing an MLTable YAML file and associated files that may already exist under path are handled. Options are 'overwrite' (or True) to replace any existing files, 'fail' (or False) to raise an error if a file already exists, or 'skip' to leave existing files as is. May also set with <xref:mltable.MLTableSaveOverwriteOptions>. |
show_progress
Required
|
displays copying progress to stdout |
if_err_remove_files
Required
|
if any error occurs during saving, removed any successfully saved files to make the operation atomic |
Returns
Type | Description |
---|---|
this MLTable instance |
select_partitions
Adds a transformation step to select the partition.
select_partitions(partition_index_list)
Parameters
Name | Description |
---|---|
partition_index_list
Required
|
list of partition index |
Returns
Type | Description |
---|---|
MLTable with partition size updated |
Remarks
The following code snippet shows how to use the select_partitions api to selected partitions from the provided MLTable.
partition_index_list = [1, 2]
mltable = mltable.select_partitions(partition_index_list)
show
Retrieves the first count rows of this MLTable as a Pandas Dataframe.
show(count=20)
Parameters
Name | Description |
---|---|
count
Required
|
number of rows from top of table to select |
Returns
Type | Description |
---|---|
<xref:Pandas> <xref:Dataframe>
|
first count rows of the MLTable |
skip
Adds a transformation step to skip the first count rows of this MLTable.
skip(count)
Parameters
Name | Description |
---|---|
count
Required
|
number of rows to skip |
Returns
Type | Description |
---|---|
MLTable with added transformation step |
take
Adds a transformation step to select the first count rows of this MLTable.
take(count=20)
Parameters
Name | Description |
---|---|
count
Required
|
number of rows from top of table to select |
Returns
Type | Description |
---|---|
MLTable with added "take" transformation step |
take_random_sample
Adds a transformation step to randomly select each row of this MLTable with probability chance. Probability must be in range [0, 1]. May optionally set a random seed.
take_random_sample(probability, seed=None)
Parameters
Name | Description |
---|---|
probability
Required
|
chance that each row is selected |
seed
Required
|
optional random seed |
Returns
Type | Description |
---|---|
MLTable with added transformation step |
to_pandas_dataframe
Load all records from the paths specified in the MLTable file into a Pandas DataFrame.
to_pandas_dataframe()
Returns
Type | Description |
---|---|
Pandas Dataframe containing the records from paths in this MLTable |
Remarks
The following code snippet shows how to use the to_pandas_dataframe api to obtain a pandas dataframe corresponding to the provided MLTable.
from mltable import load
tbl = load('.\samples\mltable_sample')
pdf = tbl.to_pandas_dataframe()
print(pdf.shape)
validate
Validates if this MLTable's data can be loaded, requires the MLTable's data source(s) to be accessible from the current compute.
validate()
Returns
Type | Description |
---|---|
None |
Attributes
partition_keys
paths
Returns a list of dictionaries containing the original paths given to this MLTable. Relative local file paths are assumed to be relative to the directory where the MLTable YAML file this MLTable instance was loaded from.
Returns
Type | Description |
---|---|
list of dicts containing paths specified in the MLTable |