OutputTabularDatasetConfig Class

Represent how to copy the output of a run and be promoted as a TabularDataset.

Initialize a OutputTabularDatasetConfig.

Constructor

OutputTabularDatasetConfig(**kwargs)

Remarks

You should not call this constructor directly, but instead should create a OutputFileDatasetConfig and then call the corresponding read_* methods to convert it into a OutputTabularDatasetConfig.

The way the output will be copied to the destination for a OutputTabularDatasetConfig is the same as a OutputFileDatasetConfig. The difference between them is that the Dataset that is created will be a TabularDataset containing all the specified transformations.

Methods

as_input	Specify how to consume the output as an input in subsequent pipeline steps.
as_mount	Set the mode of the output to mount. For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.
as_upload	Set the mode of the output to upload. For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.
drop_columns	Drop the specified columns from the Dataset.
keep_columns	Keep the specified columns and drops all others from the Dataset.
random_split	Split records in the dataset into two parts randomly and approximately by the percentage specified. The resultant output configs will have their names changed, the first one will have _1 appended to the name and the second one will have _2 appended to the name. If it will cause a name collision or you would like to specify a custom name, please manually set their names.

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_input(name=None)

Parameters

Name	Description
name Required	str The name of the input specific to the run.

Returns

Type	Description
DatasetConsumptionConfig	A DatasetConsumptionConfig instance describing how to deliver the input data.

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_mount()

Returns

Type	Description
OutputTabularDatasetConfig	A OutputTabularDatasetConfig instance with mode set to mount.

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_upload(overwrite=False, source_globs=None)

Parameters

Name	Description
overwrite Required	bool Whether to overwrite files that already exists in the destination.
source_globs Required	list[str] Glob patterns used to filter files that will be uploaded.

Returns

Type	Description
OutputTabularDatasetConfig	A OutputTabularDatasetConfig instance with mode set to upload.

drop_columns

Drop the specified columns from the Dataset.

drop_columns(columns)

Parameters

Name	Description
columns Required	Union[str, list[str]] The name or a list of names for the columns to drop.

Returns

Type	Description
PipelineOutputTabularDataset	A OutputTabularDatasetConfig instance with which columns to drop.

keep_columns

Keep the specified columns and drops all others from the Dataset.

keep_columns(columns)

Parameters

Name	Description
columns Required	Union[str, list[str]] The name or a list of names for the columns to keep.

Returns

Type	Description
PipelineOutputTabularDataset	A OutputTabularDatasetConfig instance with which columns to keep.

random_split

Split records in the dataset into two parts randomly and approximately by the percentage specified.

The resultant output configs will have their names changed, the first one will have _1 appended to the name and the second one will have _2 appended to the name. If it will cause a name collision or you would like to specify a custom name, please manually set their names.

random_split(percentage, seed=None)

Parameters

Name	Description
percentage Required	float The approximate percentage to split the dataset by. This must be a number between 0.0 and 1.0.
seed Required	int Optional seed to use for the random generator.

Returns

Type	Description
tuple(OutputTabularDatasetConfig, OutputTabularDatasetConfig)	Returns a tuple of two OutputTabularDatasetConfig objects representing the two Datasets after the split.

Feedback

Was this page helpful?