OutputFileDatasetConfig Class
Represent how to copy the output of a run and be promoted as a FileDataset.
The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.
An example of not passing any arguments:
workspace = Workspace.from_config()
experiment = Experiment(workspace, 'output_example')
output = OutputFileDatasetConfig()
script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])
run = experiment.submit(script_run_config)
print(run)
An example of creating an output then promoting the output to a tabular dataset and register it with name foo:
workspace = Workspace.from_config()
experiment = Experiment(workspace, 'output_example')
datastore = Datastore(workspace, 'example_adls_gen2_datastore')
# for more information on the parameters and methods, please look for the corresponding documentation.
output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')
script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])
run = experiment.submit(script_run_config)
print(run)
- Inheritance
-
OutputFileDatasetConfigOutputFileDatasetConfig
Constructor
OutputFileDatasetConfig(name=None, destination=None, source=None, partition_format=None)
Parameters
- name
- str
The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination.
- destination
- tuple
The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to.
The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}.
- source
- str
The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.
- partition_format
- str
Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
Remarks
You can pass the OutputFileDatasetConfig as an argument to your run, and it will be automatically translated into local path on the compute. The source argument will be used if one is specified, otherwise we will automatically generate a directory in the OS's temp folder. The files and folders inside the source directory will then be copied to the destination based on the output configuration.
By default the mode by which the output will be copied to the destination storage will be set to mount. For more information about mount mode, please see the documentation for as_mount.
Methods
as_input |
Specify how to consume the output as an input in subsequent pipeline steps. |
as_mount |
Set the mode of the output to mount. For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed. |
as_upload |
Set the mode of the output to upload. For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded. |
as_input
Specify how to consume the output as an input in subsequent pipeline steps.
as_input(name=None)
Parameters
Returns
A DatasetConsumptionConfig instance describing how to deliver the input data.
Return type
as_mount
Set the mode of the output to mount.
For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.
as_mount(disable_metadata_cache=False)
Parameters
- disable_metadata_cache
- bool
Whether to cache metadata in local node, if disabled a node will not be able to see files generated from other nodes during job running.
Returns
A OutputFileDatasetConfig instance with mode set to mount.
Return type
as_upload
Set the mode of the output to upload.
For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.
as_upload(overwrite=False, source_globs=None)
Parameters
Returns
A OutputFileDatasetConfig instance with mode set to upload.
Return type
Feedback
Submit and view feedback for