Version and track Azure Machine Learning datasets
APPLIES TO: Python SDK azureml v1
In this article, you'll learn how to version and track Azure Machine Learning datasets for reproducibility. Dataset versioning is a way to bookmark the state of your data so that you can apply a specific version of the dataset for future experiments.
Typical versioning scenarios:
- When new data is available for retraining
- When you're applying different data preparation or feature engineering approaches
For this tutorial, you need:
import azureml.core from azureml.core import Workspace ws = Workspace.from_config()
Register and retrieve dataset versions
By registering a dataset, you can version, reuse, and share it across experiments and with colleagues. You can register multiple datasets under the same name and retrieve a specific version by name and version number.
Register a dataset version
The following code registers a new version of the
titanic_ds dataset by setting the
create_new_version parameter to
True. If there's no existing
titanic_ds dataset registered with the workspace, the code creates a new dataset with the name
titanic_ds and sets its version to 1.
titanic_ds = titanic_ds.register(workspace = workspace, name = 'titanic_ds', description = 'titanic training data', create_new_version = True)
Retrieve a dataset by name
By default, the get_by_name() method on the
Dataset class returns the latest version of the dataset registered with the workspace.
The following code gets version 1 of the
from azureml.core import Dataset # Get a dataset by name and version number titanic_ds = Dataset.get_by_name(workspace = workspace, name = 'titanic_ds', version = 1)
Versioning best practice
When you create a dataset version, you're not creating an extra copy of data with the workspace. Because datasets are references to the data in your storage service, you have a single source of truth, managed by your storage service.
If the data referenced by your dataset is overwritten or deleted, calling a specific version of the dataset does not revert the change.
When you load data from a dataset, the current data content referenced by the dataset is always loaded. If you want to make sure that each dataset version is reproducible, we recommend that you not modify data content referenced by the dataset version. When new data comes in, save new data files into a separate data folder and then create a new dataset version to include data from that new folder.
The following image and sample code show the recommended way to structure your data folders and to create dataset versions that reference those folders:
from azureml.core import Dataset # get the default datastore of the workspace datastore = workspace.get_default_datastore() # create & register weather_ds version 1 pointing to all files in the folder of week 27 datastore_path1 = [(datastore, 'Weather/week 27')] dataset1 = Dataset.File.from_files(path=datastore_path1) dataset1.register(workspace = workspace, name = 'weather_ds', description = 'weather data in week 27', create_new_version = True) # create & register weather_ds version 2 pointing to all files in the folder of week 27 and 28 datastore_path2 = [(datastore, 'Weather/week 27'), (datastore, 'Weather/week 28')] dataset2 = Dataset.File.from_files(path = datastore_path2) dataset2.register(workspace = workspace, name = 'weather_ds', description = 'weather data in week 27, 28', create_new_version = True)
Version an ML pipeline output dataset
You can use a dataset as the input and output of each ML pipeline step. When you rerun pipelines, the output of each pipeline step is registered as a new dataset version.
ML pipelines populate the output of each step into a new folder every time the pipeline reruns. This behavior allows the versioned output datasets to be reproducible. Learn more about datasets in pipelines.
from azureml.core import Dataset from azureml.pipeline.steps import PythonScriptStep from azureml.pipeline.core import Pipeline, PipelineData from azureml.core. runconfig import CondaDependencies, RunConfiguration # get input dataset input_ds = Dataset.get_by_name(workspace, 'weather_ds') # register pipeline output as dataset output_ds = PipelineData('prepared_weather_ds', datastore=datastore).as_dataset() output_ds = output_ds.register(name='prepared_weather_ds', create_new_version=True) conda = CondaDependencies.create( pip_packages=['azureml-defaults', 'azureml-dataprep[fuse,pandas]'], pin_sdk_version=False) run_config = RunConfiguration() run_config.environment.docker.enabled = True run_config.environment.python.conda_dependencies = conda # configure pipeline step to use dataset as the input and output prep_step = PythonScriptStep(script_name="prepare.py", inputs=[input_ds.as_named_input('weather_ds')], outputs=[output_ds], runconfig=run_config, compute_target=compute_target, source_directory=project_folder)
Track data in your experiments
Azure Machine Learning tracks your data throughout your experiment as input and output datasets.
The following are scenarios where your data is tracked as an input dataset.
DatasetConsumptionConfigobject through either the
argumentsparameter of your
ScriptRunConfigobject when submitting the experiment job.
When methods like, get_by_name() or get_by_id() are called in your script. For this scenario, the name assigned to the dataset when you registered it to the workspace is the name displayed.
The following are scenarios where your data is tracked as an output dataset.
OutputFileDatasetConfigobject through either the
argumentsparameter when submitting an experiment job.
OutputFileDatasetConfigobjects can also be used to persist data between pipeline steps. See Move data between ML pipeline steps.
Register a dataset in your script. For this scenario, the name assigned to the dataset when you registered it to the workspace is the name displayed. In the following example,
training_dsis the name that would be displayed.
training_ds = unregistered_ds.register(workspace = workspace, name = 'training_ds', description = 'training data' )
Submit child job with an unregistered dataset in script. This results in an anonymous saved dataset.
Trace datasets in experiment jobs
For each Machine Learning experiment, you can easily trace the datasets used as input with the experiment
The following code uses the
get_details() method to track which input datasets were used with the experiment run:
# get input datasets inputs = run.get_details()['inputDatasets'] input_dataset = inputs['dataset'] # list the files referenced by input_dataset input_dataset.to_path()
You can also find the
input_datasets from experiments by using the Azure Machine Learning studio.
The following image shows where to find the input dataset of an experiment on Azure Machine Learning studio. For this example, go to your Experiments pane and open the Properties tab for a specific run of your experiment,
Use the following code to register models with datasets:
model = run.register_model(model_name='keras-mlp-mnist', model_path=model_path, datasets =[('training data',train_dataset)])
After registration, you can see the list of models registered with the dataset by using Python or go to the studio.
The following view is from the Datasets pane under Assets. Select the dataset and then select the Models tab for a list of the models that are registered with the dataset.