TransformationMixin Class

Reference

This class provides transformation capabilities to output datasets.

Inheritance: builtins.object

TransformationMixin

Constructor

TransformationMixin()

Methods

read_delimited_files

Transform the output dataset to a tabular dataset by reading all the output as delimited files.

read_parquet_files

Transform the output dataset to a tabular dataset by reading all the output as Parquet files.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

read_delimited_files

Transform the output dataset to a tabular dataset by reading all the output as delimited files.

read_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, path_glob=None, set_column_types=None)

Parameters

Name	Description
include_path Required	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
separator Required	str The separator used to split columns.
header Required	PromoteHeadersBehavior Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header.
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
path_glob Required	str A glob-like pattern to filter files that will be read as delimited files. If set to None, then all files will be read as delimited files. Glob is a Unix style pathname pattern expansion: https://docs.python.org/3/library/glob.html ex) *.csv -> selects files with .csv file extension test_.csv* -> selects files with filenames that startwith test_ and has .csv file extension /myrootdir/project_one///.txt* -> selects files that are two subdirectories deep in /myrootdir/project_one/ and have .txt file extension Note: Using the **** pattern in large directory trees may consume an inordinate amount of time. In general, for large directory trees, being more specific in the glob pattern can increase performance.
set_column_types Required	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

Returns

Type	Description
OutputTabularDatasetConfig	A OutputTabularDatasetConfig instance with instruction of how to convert the output into a TabularDataset.

read_parquet_files

Transform the output dataset to a tabular dataset by reading all the output as Parquet files.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

read_parquet_files(include_path=False, partition_format=None, path_glob=None, set_column_types=None)

Parameters

Name	Description
include_path Required	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
partition_format Required	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
path_glob Required	str A glob-like pattern to filter files that will be read as parquet files. If set to None, then all files will be read as parquet files. Glob is a Unix style pathname pattern expansion: https://docs.python.org/3/library/glob.html ex) *.parquet -> selects files with .parquet file extension test_.parquet* -> selects files with filenames that startwith test_ and has .parquet file extension /myrootdir/project_one///.parquet* -> selects files that are two subdirectories deep in /myrootdir/project_one/ and have .parquet file extension Note: Using the **** pattern in large directory trees may consume an inordinate amount of time. In general, for large directory trees, being more specific in the glob pattern can increase performance.
set_column_types Required	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

Returns

Type	Description
OutputTabularDatasetConfig	A OutputTabularDatasetConfig instance with instruction of how to convert the output into a TabularDataset.

Share via

TransformationMixin Class

Constructor

Methods

read_delimited_files

Parameters

Returns

read_parquet_files

Parameters

Returns

Feedback

Additional resources