TransformationMixin Class
This class provides transformation capabilities to output datasets.
- Inheritance
-
builtins.objectTransformationMixin
Constructor
TransformationMixin()
Methods
read_delimited_files |
Transform the output dataset to a tabular dataset by reading all the output as delimited files. |
read_parquet_files |
Transform the output dataset to a tabular dataset by reading all the output as Parquet files. The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output. |
read_delimited_files
Transform the output dataset to a tabular dataset by reading all the output as delimited files.
read_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, path_glob=None, set_column_types=None)
Parameters
Name | Description |
---|---|
include_path
Required
|
Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path. |
separator
Required
|
The separator used to split columns. |
header
Required
|
Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header. |
partition_format
Required
|
Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. |
path_glob
Required
|
A glob-like pattern to filter files that will be read as delimited files. If set to None, then all files will be read as delimited files. Glob is a Unix style pathname pattern expansion: https://docs.python.org/3/library/glob.html ex)
Note: Using the **** pattern in large directory trees may consume an inordinate amount of time. In general, for large directory trees, being more specific in the glob pattern can increase performance. |
set_column_types
Required
|
A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored. |
Returns
Type | Description |
---|---|
A OutputTabularDatasetConfig instance with instruction of how to convert the output into a TabularDataset. |
read_parquet_files
Transform the output dataset to a tabular dataset by reading all the output as Parquet files.
The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.
read_parquet_files(include_path=False, partition_format=None, path_glob=None, set_column_types=None)
Parameters
Name | Description |
---|---|
include_path
Required
|
Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path. |
partition_format
Required
|
Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. |
path_glob
Required
|
A glob-like pattern to filter files that will be read as parquet files. If set to None, then all files will be read as parquet files. Glob is a Unix style pathname pattern expansion: https://docs.python.org/3/library/glob.html ex)
Note: Using the **** pattern in large directory trees may consume an inordinate amount of time. In general, for large directory trees, being more specific in the glob pattern can increase performance. |
set_column_types
Required
|
A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored. |
Returns
Type | Description |
---|---|
A OutputTabularDatasetConfig instance with instruction of how to convert the output into a TabularDataset. |