DatasetConsumptionConfig Class

Represent how to deliver the dataset to a compute target.

Represent how to deliver the dataset to the compute target.

Inheritance
builtins.object
DatasetConsumptionConfig

Constructor

DatasetConsumptionConfig(name, dataset, mode='direct', path_on_compute=None)

Parameters

Name Description
name
Required
str

The name of the dataset in the run, which can be different to the registered name. The name will be registered as environment variable and can be used in data plane.

dataset
Required

The dataset that will be consumed in the run.

mode
str

Defines how the dataset should be delivered to the compute target. There are three modes:

  1. 'direct': consume the dataset as dataset.
  2. 'download': download the dataset and consume the dataset as downloaded path.
  3. 'mount': mount the dataset and consume the dataset as mount path.
  4. 'hdfs': consume the dataset from resolved hdfs path (Currently only supported on SynapseSpark compute).
default value: direct
path_on_compute
str

The target path on the compute to make the data available at. The folder structure of the source data will be kept, however, we might add prefixes to this folder structure to avoid collision. Use tabular_dataset.to_path to see the output folder structure.

default value: None
name
Required
str

The name of the dataset in the run, which can be different to the registered name. The name will be registered as environment variable and can be used in data plane.

dataset
Required

The dataset to be delivered, as a Dataset object, Pipeline Parameter that ingests a Dataset, a tuple of (workspace, Dataset name), or a tuple of (workspace, Dataset name, Dataset version). If only a name is provided, the DatasetConsumptionConfig will use the latest version of the Dataset.

mode
Required
str

Defines how the dataset should be delivered to the compute target. There are three modes:

  1. 'direct': consume the dataset as dataset.
  2. 'download': download the dataset and consume the dataset as downloaded path.
  3. 'mount': mount the dataset and consume the dataset as mount path.
  4. 'hdfs': consume the dataset from resolved hdfs path (Currently only supported on SynapseSpark compute).
path_on_compute
Required
str

The target path on the compute to make the data available at. The folder structure of the source data will be kept, however, we might add prefixes to this folder structure to avoid collision. We recommend calling tabular_dataset.to_path to see the output folder structure.

Methods

as_download

Set the mode to download.

In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']
as_hdfs

Set the mode to hdfs.

In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_1']
as_mount

Set the mode to mount.

In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']

as_download

Set the mode to download.

In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']
as_download(path_on_compute=None)

Parameters

Name Description
path_on_compute
str

The target path on the compute to make the data available at.

default value: None

Remarks

When the dataset is created from path of a single file, the download location will be path of the single downloaded file. Otherwise, the download location will be path of the enclosing folder for all the downloaded files.

If path_on_compute starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the working directory. If you have specified an absolute path, please make sure that the job has permission to write to that directory.

as_hdfs

Set the mode to hdfs.

In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_1']
as_hdfs()

Remarks

When the dataset is created from path of a single file, the hdfs path will be path of the single file. Otherwise, the hdfs path will be path of the enclosing folder for all the mounted files.

as_mount

Set the mode to mount.

In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context.


   file_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
   file_pipeline_param = PipelineParameter(name="file_ds_param", default_value=file_dataset)
   dataset_input = DatasetConsumptionConfig("input_1", file_pipeline_param).as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']
as_mount(path_on_compute=None)

Parameters

Name Description
path_on_compute
str

The target path on the compute to make the data available at.

default value: None

Remarks

When the dataset is created from path of a single file, the mount point will be path of the single mounted file. Otherwise, the mount point will be path of the enclosing folder for all the mounted files.

If path_on_compute starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the working directory. If you have specified an absolute path, please make sure that the job has permission to write to that directory.

Attributes

name

Name of the input.

Returns

Type Description

Name of the input.