MLTable Class
Represents a MLTable.
A MLTable defines a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data.
Initialize a new MLTable.
This constructor is not supposed to be invoked directly. MLTable is intended to be created using load.
- Inheritance
-
builtins.objectMLTable
Constructor
MLTable()
Methods
convert_column_types |
Adds a transformation step to convert the specified columns into their respective specified new types.
|
drop_columns |
Adds a transformation step to drop the given columns from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException. Attempting to drop a column that is MLTable.traits.timestamp_column or in MLTable.traits.index_columns will raise a UserErrorException. |
extract_columns_from_partition_format |
Adds a transformation step to use the partition information of each path and extract them into columns based on the specified partition format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '/Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. |
filter |
Filter the data, leaving only the records that match the specified expression. |
get_partition_count |
Returns the number of data partitions underlying the data associated with this MLTable. |
keep_columns |
Adds a transformation step to keep the specified columns and drop all others from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException. If column in MLTable.traits.timestamp_column or columns in MLTable.traits.index_columns are not explicitly kept, a UserErrorException is raiesd. |
random_split |
Randomly splits this MLTable into two MLTables, one having approximately "percent"% of the original MLTable's data and the other having the remainder (1-"percent"%). |
save |
Save this MLTable as a MLTable YAML file & its assoicated paths to the given directory path. If path is not given, defaults to the current working directory. If path does not exist, it is created. If path is remote, the underlying data store must already exist. If path is a local directory & is not absolute, it is made absolute. If path points to a file, a UserErrorException is raised. If path is a directory path that already contain one or more files being saved (including the MLTable YAML file) and overwrite is set to False or 'fail' - a UserErrorException is raised. If path is remote, any local files paths not given as a colocated path (file path relative to the directory that MLTable was loaded from) will raise a UserErrorException. colocated controls how associated paths are saved to path. If True, files are copied to path alongside the MLTable YAML file as relative file paths. Otherwise associated files are not copied, remote paths remain as given and local file paths are made relative with path redirection if needed. Note that False may result in noncolocated MLTable YAML files which is not recommended, furthermore if path is remote this will result in a UserErrorException as relative path redirection is not supported for remote URIs. Note that if the MLTable is created programatically with methods like from_paths() or from_read_delimited_files() with local relative paths, the MLTable directory path is assumed to be the current working directory. Be mindful when saving a new MLTable & associated data files to a directory with an existing MLTable file & associated data files that the directory is not cleared of existing files before saving the new files. It is possible for already existing data files to persist after saving the new files, especially if existing data files do not have names matching any new data files. If the new MLTable contains a pattern designator under its paths, this may unintentionally alter the MLTable by associating existing data files with the new MLTable. If file paths in this MLTable point to an existing file in path but have a different URI, if overwrite is 'fail' or 'skip' the existing file will not be overwritten (i.e. skipped). |
select_partitions |
Adds a transformation step to select the partition. |
show |
Retrieves the first count rows of this MLTable as a Pandas Dataframe. |
skip |
Adds a transformation step to skip the first count rows of this MLTable. |
take |
Adds a transformation step to select the first count rows of this MLTable. |
take_random_sample |
Adds a transformation step to randomly select each row of this MLTable with probability chance. Probability must be in range [0, 1]. May optionally set a random seed. |
to_pandas_dataframe |
Load all records from the paths specified in the MLTable file into a Pandas DataFrame. |
validate |
Validates if this MLTable's data can be loaded, requires the MLTable's data source(s) to be accessible from the current compute. |
convert_column_types
Adds a transformation step to convert the specified columns into their respective specified new types.
from mltable import DataType
data_types = {
'ID': DataType.to_string(),
'Date': DataType.to_datetime('%d/%m/%Y %I:%M:%S %p'),
'Count': DataType.to_int(),
'Latitude': DataType.to_float(),
'Found': DataType.to_bool(),
'Stream': DataType.to_stream()
}
convert_column_types(column_types)
Parameters
Dictionary of column: types the user desires to convert
Returns
MLTable with added transformation step
Return type
drop_columns
Adds a transformation step to drop the given columns from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.
Attempting to drop a column that is MLTable.traits.timestamp_column or in MLTable.traits.index_columns will raise a UserErrorException.
drop_columns(columns: str | List[str] | Tuple[str] | Set[str])
Parameters
column(s) to drop from this MLTable
Returns
MLTable with added transformation step
Return type
extract_columns_from_partition_format
Adds a transformation step to use the partition information of each path and extract them into columns based on the specified partition format.
Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type.
The format should start from the position of first partition key until the end of file path. For example, given the path '/Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
extract_columns_from_partition_format(partition_format)
Parameters
Returns
MLTable whose partition format is set to given format
Return type
filter
Filter the data, leaving only the records that match the specified expression.
filter(expression)
Parameters
Returns
MLTable after filter
Return type
Remarks
Expressions are started by indexing the mltable with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.
filtered_mltable = mltable.filter('feature_1 == "5" and target > "0.5)"')
filtered_mltable = mltable.filter('col("FBI Code") == "11"')
get_partition_count
Returns the number of data partitions underlying the data associated with this MLTable.
get_partition_count() -> int
Returns
data partitions in this MLTable
Return type
keep_columns
Adds a transformation step to keep the specified columns and drop all others from the dataset. If an empty list, tuple, or set is given nothing is dropped. Duplicate columns will raise a UserErrorException.
If column in MLTable.traits.timestamp_column or columns in MLTable.traits.index_columns are not explicitly kept, a UserErrorException is raiesd.
keep_columns(columns: str | List[str] | Tuple[str] | Set[str])
Parameters
columns(s) in this MLTable to keep
Returns
MLTable with added transformation step
Return type
random_split
Randomly splits this MLTable into two MLTables, one having approximately "percent"% of the original MLTable's data and the other having the remainder (1-"percent"%).
random_split(percent=0.5, seed=None)
Parameters
Returns
two MLTables with this MLTable's data split between them by "percent"
Return type
save
Save this MLTable as a MLTable YAML file & its assoicated paths to the given directory path.
If path is not given, defaults to the current working directory. If path does not exist, it is created. If path is remote, the underlying data store must already exist. If path is a local directory & is not absolute, it is made absolute.
If path points to a file, a UserErrorException is raised. If path is a directory path that already contain one or more files being saved (including the MLTable YAML file) and overwrite is set to False or 'fail' - a UserErrorException is raised. If path is remote, any local files paths not given as a colocated path (file path relative to the directory that MLTable was loaded from) will raise a UserErrorException.
colocated controls how associated paths are saved to path. If True, files are copied to path alongside the MLTable YAML file as relative file paths. Otherwise associated files are not copied, remote paths remain as given and local file paths are made relative with path redirection if needed. Note that False may result in noncolocated MLTable YAML files which is not recommended, furthermore if path is remote this will result in a UserErrorException as relative path redirection is not supported for remote URIs.
Note that if the MLTable is created programatically with methods like from_paths() or from_read_delimited_files() with local relative paths, the MLTable directory path is assumed to be the current working directory.
Be mindful when saving a new MLTable & associated data files to a directory with an existing MLTable file & associated data files that the directory is not cleared of existing files before saving the new files. It is possible for already existing data files to persist after saving the new files, especially if existing data files do not have names matching any new data files. If the new MLTable contains a pattern designator under its paths, this may unintentionally alter the MLTable by associating existing data files with the new MLTable.
If file paths in this MLTable point to an existing file in path but have a different URI, if overwrite is 'fail' or 'skip' the existing file will not be overwritten (i.e. skipped).
save(path=None, overwrite=True, colocated=False, show_progress=False, if_err_remove_files=True)
Parameters
- colocated
- bool
If True, saves copies of local & remote file paths in this MLTable under path as relative paths. Otherwise no file copying occurs and remote file paths are saved as given to the saved MLTable YAML file and local file paths as relative file paths with path redirection. If path is remote & this MLTable contains local file paths, a UserErrorException will be raised.
- overwrite
- Union[bool, str, <xref:mltable.MLTableSaveOverwriteOptions>]
How existing an MLTable YAML file and associated files that may already exist under path are handled. Options are 'overwrite' (or True) to replace any existing files, 'fail' (or False) to raise an error if a file already exists, or 'skip' to leave existing files as is. May also set with <xref:mltable.MLTableSaveOverwriteOptions>.
- if_err_remove_files
- bool
if any error occurs during saving, removed any successfully saved files to make the operation atomic
Returns
this MLTable instance
Return type
select_partitions
Adds a transformation step to select the partition.
select_partitions(partition_index_list)
Parameters
Returns
MLTable with partition size updated
Return type
Remarks
The following code snippet shows how to use the select_partitions api to selected partitions from the provided MLTable.
partition_index_list = [1, 2]
mltable = mltable.select_partitions(partition_index_list)
show
Retrieves the first count rows of this MLTable as a Pandas Dataframe.
show(count=20)
Parameters
Returns
first count rows of the MLTable
Return type
skip
Adds a transformation step to skip the first count rows of this MLTable.
skip(count)
Parameters
Returns
MLTable with added transformation step
take
Adds a transformation step to select the first count rows of this MLTable.
take(count=20)
Parameters
Returns
MLTable with added "take" transformation step
Return type
take_random_sample
Adds a transformation step to randomly select each row of this MLTable with probability chance. Probability must be in range [0, 1]. May optionally set a random seed.
take_random_sample(probability, seed=None)
Parameters
- probability
chance that each row is selected
Returns
MLTable with added transformation step
Return type
to_pandas_dataframe
Load all records from the paths specified in the MLTable file into a Pandas DataFrame.
to_pandas_dataframe()
Returns
Pandas Dataframe containing the records from paths in this MLTable
Return type
Remarks
The following code snippet shows how to use the to_pandas_dataframe api to obtain a pandas dataframe corresponding to the provided MLTable.
from mltable import load
tbl = load('.\samples\mltable_sample')
pdf = tbl.to_pandas_dataframe()
print(pdf.shape)
validate
Validates if this MLTable's data can be loaded, requires the MLTable's data source(s) to be accessible from the current compute.
validate()
Returns
None
Return type
Attributes
partition_keys
paths
Returns a list of dictionaries containing the original paths given to this MLTable. Relative local file paths are assumed to be relative to the directory where the MLTable YAML file this MLTable instance was loaded from.
Returns
list of dicts containing paths specified in the MLTable
Return type
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for