PyTorch Class
Represents an estimator for training in PyTorch experiments.
DEPRECATED. Use the ScriptRunConfig object with your own defined environment or one of the Azure ML PyTorch curated environments. For an introduction to configuring PyTorch experiment runs with ScriptRunConfig, see Train PyTorch models at scale with Azure Machine Learning.
Supported versions: 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
Initialize a PyTorch estimator.
Docker run reference. :type shm_size: str :param resume_from: The data path containing the checkpoint or model files from which to resume the experiment. :type resume_from: azureml.data.datapath.DataPath :param max_run_duration_seconds: The maximum allowed time for the run. Azure ML will attempt to automatically
cancel the run if it takes longer than this value.
- Inheritance
-
azureml.train.estimator._framework_base_estimator._FrameworkBaseEstimatorPyTorch
Constructor
PyTorch(source_directory, *, compute_target=None, vm_size=None, vm_priority=None, entry_script=None, script_params=None, node_count=1, process_count_per_node=1, distributed_backend=None, distributed_training=None, use_gpu=False, use_docker=True, custom_docker_base_image=None, custom_docker_image=None, image_registry_details=None, user_managed=False, conda_packages=None, pip_packages=None, conda_dependencies_file_path=None, pip_requirements_file_path=None, conda_dependencies_file=None, pip_requirements_file=None, environment_variables=None, environment_definition=None, inputs=None, source_directory_data_store=None, shm_size=None, resume_from=None, max_run_duration_seconds=None, framework_version=None, _enable_optimized_mode=False, _disable_validation=True, _show_lint_warnings=False, _show_package_warnings=False)
Parameters
- compute_target
- AbstractComputeTarget or str
The compute target where training will happen. This can either be an object or the string "local".
- vm_size
- str
The VM size of the compute target that will be created for the training. Supported values: Any Azure VM size.
- vm_priority
- str
The VM priority of the compute target that will be created for the training. If not specified, 'dedicated' is used.
Supported values: 'dedicated' and 'lowpriority'.
This takes effect only when the vm_size param
is specified in the input.
- script_params
- dict
A dictionary of command-line arguments to pass to the training script specified in
entry_script
.
- node_count
- int
The number of nodes in the compute target used for training. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.
- process_count_per_node
- int
The number of processes per node. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.
- distributed_backend
- str
The communication backend for distributed training.
DEPRECATED. Use the distributed_training
parameter.
Supported values: 'mpi', 'gloo' and 'nccl'.
'mpi': MPI/Horovod 'gloo', 'nccl': Native PyTorch Distributed Training
This parameter is required when node_count
or process_count_per_node
> 1.
When node_count
== 1 and process_count_per_node
== 1, no backend will be used
unless the backend is explicitly set. Only the AmlCompute target is
supported for distributed training.
Parameters for running a distributed training job.
For running a distributed job with MPI backend, use Mpi
object to specify process_count_per_node
.
For running a distributed job with gloo backend, use Gloo.
For running a distributed job with nccl backend, use Nccl.
- use_gpu
- bool
Specifies whether the environment to run the experiment should support GPUs.
If true, a GPU-based default Docker image will be used in the environment. If false, a CPU-based
image will be used. Default docker images (CPU or GPU) will be used only if the custom_docker_image
parameter is not set. This setting is used only in Docker-enabled compute targets.
- use_docker
- bool
Specifies whether the environment to run the experiment should be Docker-based.
- custom_docker_base_image
- str
The name of the Docker image from which the image to use for training will be built.
DEPRECATED. Use the custom_docker_image
parameter.
If not set, a default CPU-based image will be used as the base image.
- custom_docker_image
- str
The name of the Docker image from which the image to use for training will be built. If not set, a default CPU-based image will be used as the base image.
- user_managed
- bool
Specifies whether Azure ML reuses an existing python environment. If false, Azure ML will create a Python environment based on the conda dependencies specification.
- conda_packages
- list
A list of strings representing conda packages to be added to the Python environment for the experiment.
- pip_packages
- list
A list of strings representing pip packages to be added to the Python environment for the experiment.
- conda_dependencies_file_path
- str
The relative path to the conda dependencies yaml file.
If specified, Azure ML will not install any framework related packages.
DEPRECATED. Use the conda_dependencies_file
parameter.
- pip_requirements_file_path
- str
The relative path to the pip requirements text file.
This can be provided in combination with the pip_packages
parameter.
DEPRECATED. Use the pip_requirements_file
parameter.
- conda_dependencies_file
- str
The relative path to the conda dependencies yaml file. If specified, Azure ML will not install any framework related packages.
- pip_requirements_file
- str
The relative path to the pip requirements text file.
This can be provided in combination with the pip_packages
parameter.
- environment_variables
- dict
A dictionary of environment variables names and values. These environment variables are set on the process where user script is being executed.
- environment_definition
- Environment
The environment definition for the experiment. It includes
PythonSection, DockerSection, and environment variables. Any environment option not directly
exposed through other parameters to the Estimator construction can be set using this
parameter. If this parameter is specified, it will take precedence over other environment-related
parameters like use_gpu
, custom_docker_image
, conda_packages
, or pip_packages
.
Errors will be reported on invalid combinations of parameters.
- shm_size
- str
The size of the Docker container's shared memory block. If not set, the default azureml.core.environment._DEFAULT_SHM_SIZE is used. For more information, see Docker run reference.
- resume_from
- DataPath
The data path containing the checkpoint or model files from which to resume the experiment.
- max_run_duration_seconds
- int
The maximum allowed time for the run. Azure ML will attempt to automatically cancel the run if it takes longer than this value.
- framework_version
- str
The PyTorch version to be used for executing training code.
PyTorch.get_supported_versions()
returns a list of the versions supported by the current SDK.
- compute_target
- AbstractComputeTarget or str
The compute target where training will happen. This can either be an object or the string "local".
- vm_size
- str
The VM size of the compute target that will be created for the training. Supported values: Any Azure VM size.
- vm_priority
- str
The VM priority of the compute target that will be created for the training. If not specified, 'dedicated' is used.
Supported values: 'dedicated' and 'lowpriority'.
This takes effect only when the vm_size param
is specified in the input.
- script_params
- dict
A dictionary of command-line arguments to pass to the training script specified in
entry_script
.
- node_count
- int
The number of nodes in the compute target used for training. If greater than 1, mpi distributed job will be run. Only the azureml.core.compute.AmlCompute target is supported for distributed jobs.
- process_count_per_node
- int
The number of processes per node. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.
- distributed_backend
- str
The communication backend for distributed training.
DEPRECATED. Use the distributed_training
parameter.
Supported values: 'mpi', 'gloo' and 'nccl'.
'mpi': MPI/Horovod 'gloo', 'nccl': Native PyTorch Distributed Training
This parameter is required when node_count
or process_count_per_node
> 1.
When node_count
== 1 and process_count_per_node
== 1, no backend will be used
unless the backend is explicitly set. Only the AmlCompute target is
supported for distributed training.
Parameters for running a distributed training job.
For running a distributed job with MPI backend, use Mpi
object to specify process_count_per_node
.
For running a distributed job with gloo backend, use Gloo.
For running a distributed job with nccl backend, use Nccl.
- use_gpu
- bool
Specifies whether the environment to run the experiment should support GPUs.
If true, a GPU-based default Docker image will be used in the environment. If false, a CPU-based
image will be used. Default docker images (CPU or GPU) will be used only if the custom_docker_image
parameter is not set. This setting is used only in Docker-enabled compute targets.
- use_docker
- bool
Specifies whether the environment to run the experiment should be Docker-based.
- custom_docker_base_image
- str
The name of the Docker image from which the image to use for training will be built.
DEPRECATED. Use the custom_docker_image
parameter.
If not set, a default CPU-based image will be used as the base image.
- custom_docker_image
- str
The name of the Docker image from which the image to use for training will be built. If not set, a default CPU-based image will be used as the base image.
- user_managed
- bool
Specifies whether Azure ML reuses an existing Python environment. If false, Azure ML will create a Python environment based on the conda dependencies specification.
- conda_packages
- list
A list of strings representing conda packages to be added to the Python environment for the experiment.
- pip_packages
- list
A list of strings representing pip packages to be added to the Python environment for the experiment.
- conda_dependencies_file_path
- str
The relative path to the conda dependencies
yaml file. If specified, Azure ML will not install any framework related packages.
DEPRECATED. Use the conda_dependencies_file
parameter.
- pip_requirements_file_path
- str
The relative path to the pip requirements text file.
This can be provided in combination with the pip_packages
parameter.
DEPRECATED. Use the pip_requirements_file
parameter.
- conda_dependencies_file
- str
The relative path to the conda dependencies yaml file. If specified, Azure ML will not install any framework related packages.
- pip_requirements_file
- str
The relative path to the pip requirements text file.
This can be provided in combination with the pip_packages
parameter.
- environment_variables
- dict
A dictionary of environment variables names and values. These environment variables are set on the process where user script is being executed.
- environment_definition
- Environment
The environment definition for the experiment. It includes
PythonSection, DockerSection, and environment variables. Any environment option not directly
exposed through other parameters to the Estimator construction can be set using this
parameter. If this parameter is specified, it will take precedence over other environment-related
parameters like use_gpu
, custom_docker_image
, conda_packages
, or pip_packages
.
Errors will be reported on invalid combinations.
- shm_size
The size of the Docker container's shared memory block. If not set, the default azureml.core.environment._DEFAULT_SHM_SIZE is used. For more information, see
- framework_version
- str
The PyTorch version to be used for executing training code.
PyTorch.get_supported_versions()
returns a list of the versions supported by the current SDK.
- _enable_optimized_mode
- bool
Enable incremental environment build with pre-built framework images for faster environment preparation. A pre-built framework image is built on top of Azure ML default CPU/GPU base images with framework dependencies pre-installed.
- _disable_validation
- bool
Disable script validation before run submission. The default is True.
Remarks
When submitting a training job, Azure ML runs your script in a conda environment within a Docker container. The PyTorch containers have the following dependencies installed.
Dependencies | PyTorch 1.0/1.1/1.2/1.3/ | PyTorch 1.4/1.5/1.6 | ———————- | —————– | ————- | Python | 3.6.2 | 3.6.2 | CUDA (GPU image only) | 10.0 | 10.1 | cuDNN (GPU image only) | 7.6.3 | 7.6.3 | NCCL (GPU image only) | 2.4.8 | 2.4.8 | azureml-defaults | Latest | Latest | OpenMpi | 3.1.2 | 3.1.2 | horovod | 0.18.1 | 0.18.1/0.19.1/0.19.5 | miniconda | 4.5.11 | 4.5.11 | torch | 1.0/1.1/1.2/1.3.1 | 1.4.0/1.5.0/1.6.0 | torchvision | 0.4.1 | 0.5.0 | git | 2.7.4 | 2.7.4 | tensorboard | 1.14 | 1.14 | future | 0.17.1 | 0.17.1 |
The Docker images extend Ubuntu 16.04.
To install additional dependencies, you can either use the pip_packages
or conda_packages
parameter. Or, you can specify the pip_requirements_file
or conda_dependencies_file
parameter.
Alternatively, you can build your own image, and pass the custom_docker_image
parameter to the
estimator constructor.
For more information about Docker containers used in PyTorch training, see https://github.com/Azure/AzureML-Containers.
The PyTorch estimator supports distributed training across CPU and GPU clusters using Horovod, an open-source, all reduce framework for distributed training. For examples and more information about using PyTorch in distributed training, see the tutorial Train and register PyTorch models at scale with Azure Machine Learning.
Attributes
DEFAULT_VERSION
DEFAULT_VERSION = '1.4'
FRAMEWORK_NAME
FRAMEWORK_NAME = 'PyTorch'
Feedback
Submit and view feedback for