TransformationMixin Class

This class provides transformation capabilities to output datasets.

Inheritance
builtins.object
TransformationMixin

Constructor

TransformationMixin()

Methods

read_delimited_files

Transform the output dataset to a tabular dataset by reading all the output as delimited files.

read_parquet_files

Transform the output dataset to a tabular dataset by reading all the output as Parquet files.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

read_delimited_files

Transform the output dataset to a tabular dataset by reading all the output as delimited files.

read_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, path_glob=None, set_column_types=None)

Parameters

Name Description
include_path
Required

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

separator
Required
str

The separator used to split columns.

header
Required

Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header.

partition_format
Required
str

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

path_glob
Required
str

A glob-like pattern to filter files that will be read as delimited files. If set to None, then all files will be read as delimited files.

Glob is a Unix style pathname pattern expansion: https://docs.python.org/3/library/glob.html

ex)

  • *.csv -> selects files with .csv file extension
  • test_.csv* -> selects files with filenames that startwith test_ and has .csv file extension
  • /myrootdir/project_one///.txt* -> selects files that are two subdirectories deep in /myrootdir/project_one/ and have .txt file extension

Note: Using the **** pattern in large directory trees may consume an inordinate amount of time. In general, for large directory trees, being more specific in the glob pattern can increase performance.

set_column_types
Required

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

Returns

Type Description

A OutputTabularDatasetConfig instance with instruction of how to convert the output into a TabularDataset.

read_parquet_files

Transform the output dataset to a tabular dataset by reading all the output as Parquet files.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

read_parquet_files(include_path=False, partition_format=None, path_glob=None, set_column_types=None)

Parameters

Name Description
include_path
Required

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

partition_format
Required
str

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

path_glob
Required
str

A glob-like pattern to filter files that will be read as parquet files. If set to None, then all files will be read as parquet files.

Glob is a Unix style pathname pattern expansion: https://docs.python.org/3/library/glob.html

ex)

  • *.parquet -> selects files with .parquet file extension
  • test_.parquet* -> selects files with filenames that startwith test_ and has .parquet file extension
  • /myrootdir/project_one///.parquet* -> selects files that are two subdirectories deep in /myrootdir/project_one/ and have .parquet file extension

Note: Using the **** pattern in large directory trees may consume an inordinate amount of time. In general, for large directory trees, being more specific in the glob pattern can increase performance.

set_column_types
Required

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

Returns

Type Description

A OutputTabularDatasetConfig instance with instruction of how to convert the output into a TabularDataset.