CLI (v2) mltable YAML schema
APPLIES TO:
Azure CLI ml extension v2 (current)
Find the source JSON schema at https://azuremlschemas.azureedge.net/latest/MLTable.schema.json.
Note
The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.
How to author MLTable
files
This article contains information relating to the MLTable
YAML schema only. For more information on MLTable, including MLTable
file authoring, MLTable artifacts creation, consuming in Pandas and Spark, and end-to-end examples, read Working with tables in Azure Machine Learning.
YAML syntax
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
$schema |
string | The YAML schema. If you use the Azure Machine Learning VS Code extension to author the YAML file, you can invoke schema and resource completions if you include $schema at the top of your file. |
||
type |
const | mltable abstracts the schema definition for tabular data, to make it easier for data consumers to materialize the table into a Pandas/Dask/Spark dataframe |
mltable |
mltable |
paths |
array | Paths can be a file path, folder path, or pattern for paths. pattern supports globbing patterns that specify sets of filenames with wildcard characters (* , ? , [abc] , [a-z] ). Supported URI types: azureml , https , wasbs , abfss , and adl . See Core yaml syntax for more information that explains how to use the azureml:// URI format. |
file folder pattern |
|
transformations |
array | A defined transformation sequence, applied to data loaded from defined paths. Read Transformations for more information. | read_delimited read_parquet read_json_lines read_delta_lake take take_random_sample drop_columns keep_columns convert_column_types skip filter extract_columns_from_partition_format |
Transformations
Read transformations
Read Transformation | Description | Parameters |
---|---|---|
read_delimited |
Adds a transformation step to read delimited text file(s) provided in paths . |
infer_column_types : Boolean to infer column data types. Defaults to True. Type inference requires that the current compute can access the data source. Currently, type inference will only pull the first 200 rows.encoding : Specify the file encoding. Supported encodings: utf8 , iso88591 , latin1 , ascii , utf16 , utf32 , utf8bom and windows1252 . Default encoding: utf8 .header : user can choose one of the following options: no_header , from_first_file , all_files_different_headers , all_files_same_headers . Defaults to all_files_same_headers .delimiter : The separator used to split columns.empty_as_string : Specify if empty field values should load as empty strings. The default (False) will read empty field values as nulls. Passing this setting as True will read empty field values as empty strings. If the values are converted to numeric or datetime, then this setting has no effect, as empty values will be converted to nulls.include_path_column : Boolean to keep path information as column in the table. Defaults to False. This setting is useful when reading multiple files, and you want to know from which file a specific record originated. Additionally, you can keep useful information in the file path.support_multi_line : By default (support_multi_line=False ), all line breaks, including line breaks in quoted field values, will be interpreted as a record break. This approach to data reading increases speed, and it offers optimization for parallel execution on multiple CPU cores. However, it may result in silent production of more records with misaligned field values. Set this value to True when the delimited files are known to contain quoted line breaks. |
read_parquet |
Adds a transformation step to read Parquet formatted file(s) provided in paths . |
include_path_column : Boolean to keep path information as a table column. Defaults to False. This setting helps when you read multiple files, and you want to know from which file a specific record originated. Additionally, you can keep useful information in the file path. |
read_delta_lake |
Adds a transformation step to read a Delta Lake folder provided in paths . You can read the data at a particular timestamp or version. |
timestamp_as_of : String. Timestamp to be specified for time-travel on the specific Delta Lake data. To read data at a specific point in time, the datetime string should have a RFC-3339/ISO-8601 format. (for example: "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00", "2022-10-01T01:30:00-08:00")version_as_of : Integer. Version to be specified for time-travel on the specific Delta Lake data.One value of timestamp_as_of or version_as_of must be provided. |
read_json_lines |
Adds a transformation step to read the json file(s) provided in paths . |
include_path_column : Boolean to keep path information as column in the MLTable. Defaults to False. This setting becomes useful to read multiple files, and you want to know from which file a particular record originated. Additionally, you can keep useful information in file path.invalid_lines : How to handle lines that have invalid JSON. Supported values: error and drop . Defaults to error .encoding : Specify the file encoding. Supported encodings are utf8 , iso88591 , latin1 , ascii , utf16 , utf32 , utf8bom and windows1252 . Default is utf8 . |
Other transformations
Transformation | Description | Parameters | Example(s) |
---|---|---|---|
convert_column_types |
Adds a transformation step to convert the specified columns into their respective specified new types. | columns An array of column names to convert. column_type The type to which you want to convert ( int , float , string , boolean , datetime ) |
- convert_column_types: Convert the Age column to integer. - convert_column_types: Convert the date column to the format dd/mm/yyyy . Read to_datetime for more information about datetime conversion.- convert_column_types: Convert the is_weekday column to a boolean; yes/true/1 values in the column will map to True , and no/false/0 values in the column will map to False . Read to_bool for more information about boolean conversion. |
drop_columns |
Adds a transformation step to remove desired columns from the dataset. | An array of column names to drop | - drop_columns: ["col1", "col2"] |
keep_columns |
Adds a transformation step to keep the specified columns, and remove all others from the dataset. | An array of column names to keep | - keep_columns: ["col1", "col2"] |
extract_columns_from_partition_format |
Adds a transformation step to use the partition information of each path, and then extract them into columns based on the specified partition format. | partition format to use | - extract_columns_from_partition_format: {column_name:yyyy/MM/dd/HH/mm/ss} creates a datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second values for the datetime type |
filter |
Filter the data, leaving only the records that match the specified expression. | An expression as a string. | - filter: 'col("temperature") > 32 and col("location") == "UK"' Only leave rows where the temperature exceeds 32, and the location is the UK. |
skip |
Adds a transformation step to skip the first count rows of this MLTable. | A count of the number of rows to skip | - skip: 10 Skip first 10 rows |
take |
Adds a transformation step to select the first count rows of this MLTable. | A count of the number of rows from the top of the table to take | - take: 5 Take the first five rows. |
take_random_sample |
Adds a transformation step to randomly select each row of this MLTable, with probability chance. | probability The probability of selecting an individual row. Must be in the range [0,1]. seed Optional random seed. |
- take_random_sample: Take a 10 percent random sample of rows using a random seed of 123. |
Examples
This section provides examples of MLTable use. More examples are available:
Quickstart
In this quickstart, you'll read the famous iris dataset from a public https server. The MLTable
files should be located in a folder, so create the folder and MLTable
file using:
mkdir ./iris
cd ./iris
touch ./MLTable
Next, add the following contents to the MLTable
file:
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
transformations:
- read_delimited:
delimiter: ','
header: all_files_same_headers
include_path_column: true
You can then materialize into Pandas using:
Important
You must have the mltable
Python SDK installed. Install it with:
pip install mltable
.
import mltable
tbl = mltable.load("./iris")
df = tbl.to_pandas_dataframe()
You should see that the data includes a new column named Path
. This column contains the data path, which is https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
.
You can create a data asset using the CLI:
az ml data create --name iris-from-https --version 1 --type mltable --path ./iris
The folder containing the MLTable
will automatically upload to cloud storage (the default Azure ML datastore).
Tip
An Azure ML data asset is similar to web browser bookmarks (favorites). Instead of remembering long URIs (storage paths) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.
Delimited text files
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>
paths:
- file: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/ # a specific file on ADLS
# additional options
# - folder: ./<folder> a specific folder
# - pattern: ./*.csv # glob all the csv files in a folder
transformations:
- read_delimited:
encoding: ascii
header: all_files_same_headers
delimiter: ","
include_path_column: true
empty_as_string: false
- keep_columns: [col1, col2, col3, col4, col5, col6, col7]
# or you can drop_columns...
# - drop_columns: [col1, col2, col3, col4, col5, col6, col7]
- convert_column_types:
- columns: col1
column_type: int
- columns: col2
column_type:
datetime:
formats:
- "%d/%m/%Y"
- columns: [col1, col2, col3]
column_type:
boolean:
mismatch_as: error
true_values: ["yes", "true", "1"]
false_values: ["no", "false", "0"]
- filter: 'col("col1") > 32 and col("col7") == "a_string"'
# create a column called timestamp with the values extracted from the folder information
- extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
- skip: 10
- take_random_sample:
probability: 0.50
seed: 1394
# or you can take the first n records
# - take: 200
Parquet
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>
paths:
- pattern: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>/*.parquet
transformations:
- read_parquet:
include_path_column: false
- filter: 'col("temperature") > 32 and col("location") == "UK"'
- skip: 1000 # skip first 1000 rows
# create a column called timestamp with the values extracted from the folder information
- extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
Delta Lake
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>
paths:
- folder: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# NOTE: for read_delta_lake, you are *required* to provide either
# timestamp_as_of OR version_as_of.
# timestamp should be in RFC-3339/ISO-8601 format (for example:
# "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00",
# "2022-10-01T01:30:00-08:00")
# To get the latest, set the timestamp_as_of at a future point (for example: '2999-08-26T00:00:00Z')
transformations:
- read_delta_lake:
timestamp_as_of: '2022-08-26T00:00:00Z'
# alternative:
# version_as_of: 1
JSON
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
paths:
- file: ./order_invalid.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: drop
include_path_column: false
Next steps
Feedback
Submit and view feedback for