SparkJob Class

A standalone Spark job.

Inheritance
azure.ai.ml.entities._job.job.Job
SparkJob
azure.ai.ml.entities._job.parameterized_spark.ParameterizedSpark
SparkJob
azure.ai.ml.entities._job.job_io_mixin.JobIOMixin
SparkJob
azure.ai.ml.entities._job.spark_job_entry_mixin.SparkJobEntryMixin
SparkJob

Constructor

SparkJob(*, driver_cores: int | str | None = None, driver_memory: str | None = None, executor_cores: int | str | None = None, executor_memory: str | None = None, executor_instances: int | str | None = None, dynamic_allocation_enabled: bool | str | None = None, dynamic_allocation_min_executors: int | str | None = None, dynamic_allocation_max_executors: int | str | None = None, inputs: Dict[str, Input | str | bool | int | float] | None = None, outputs: Dict[str, Output] | None = None, compute: str | None = None, identity: Dict[str, str] | ManagedIdentityConfiguration | AmlTokenConfiguration | UserIdentityConfiguration | None = None, resources: Dict | SparkResourceConfiguration | None = None, **kwargs: Any)

Keyword-Only Parameters

Name Description
driver_cores

The number of cores to use for the driver process, only in cluster mode.

driver_memory

The amount of memory to use for the driver process, formatted as strings with a size unit suffix ("k", "m", "g" or "t") (e.g. "512m", "2g").

executor_cores

The number of cores to use on each executor.

executor_memory

The amount of memory to use per executor process, formatted as strings with a size unit suffix ("k", "m", "g" or "t") (e.g. "512m", "2g").

executor_instances

The initial number of executors.

dynamic_allocation_enabled

Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.

dynamic_allocation_min_executors

The lower bound for the number of executors if dynamic allocation is enabled.

dynamic_allocation_max_executors

The upper bound for the number of executors if dynamic allocation is enabled.

inputs

The mapping of input data bindings used in the job.

outputs

The mapping of output data bindings used in the job.

compute

The compute resource the job runs on.

identity
Optional[Union[dict[str, str], <xref:azure.ai.ml.ManagedIdentityConfiguration>, <xref:azure.ai.ml.AmlTokenConfiguration>, <xref:azure.ai.ml.UserIdentityConfiguration>]]

The identity that the Spark job will use while running on compute.

resources
Required

Examples

Configuring a SparkJob.


   from azure.ai.ml import Input, Output
   from azure.ai.ml.entities import SparkJob

   spark_job = SparkJob(
       code="./sdk/ml/azure-ai-ml/tests/test_configs/dsl_pipeline/spark_job_in_pipeline/basic_src",
       entry={"file": "sampleword.py"},
       conf={
           "spark.driver.cores": 2,
           "spark.driver.memory": "1g",
           "spark.executor.cores": 1,
           "spark.executor.memory": "1g",
           "spark.executor.instances": 1,
       },
       environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:33",
       inputs={
           "input1": Input(
               type="uri_file", path="azureml://datastores/workspaceblobstore/paths/python/data.csv", mode="direct"
           )
       },
       compute="synapsecompute",
       outputs={"component_out_path": Output(type="uri_folder")},
       args="--input1 ${{inputs.input1}} --output2 ${{outputs.output1}} --my_sample_rate ${{inputs.sample_rate}}",
   )


Methods

dump

Dumps the job content into a file in YAML format.

filter_conf_fields

Filters out the fields of the conf attribute that are not among the Spark configuration fields listed in ~azure.ai.ml._schema.job.parameterized_spark.CONF_KEY_MAP and returns them in their own dictionary.

dump

Dumps the job content into a file in YAML format.

dump(dest: str | PathLike | IO, **kwargs: Any) -> None

Parameters

Name Description
dest
Required
Union[<xref:PathLike>, str, IO[AnyStr]]

The local path or file stream to write the YAML content to. If dest is a file path, a new file will be created. If dest is an open file, the file will be written to directly.

Exceptions

Type Description

Raised if dest is a file path and the file already exists.

Raised if dest is an open file and the file is not writable.

filter_conf_fields

Filters out the fields of the conf attribute that are not among the Spark configuration fields listed in ~azure.ai.ml._schema.job.parameterized_spark.CONF_KEY_MAP and returns them in their own dictionary.

filter_conf_fields() -> Dict[str, str]

Returns

Type Description

A dictionary of the conf fields that are not Spark configuration fields.

Exceptions

Type Description

Raised if dest is a file path and the file already exists.

Raised if dest is an open file and the file is not writable.

Attributes

base_path

The base path of the resource.

Returns

Type Description
str

The base path of the resource.

creation_context

The creation context of the resource.

Returns

Type Description

The creation metadata for the resource.

entry

environment

The Azure ML environment to run the Spark component or job in.

Returns

Type Description

The Azure ML environment to run the Spark component or job in.

id

The resource ID.

Returns

Type Description

The global ID of the resource, an Azure Resource Manager (ARM) ID.

identity

The identity that the Spark job will use while running on compute.

Returns

Type Description
Optional[Union[<xref:azure.ai.ml.ManagedIdentityConfiguration>, <xref:azure.ai.ml.AmlTokenConfiguration>, <xref:azure.ai.ml.UserIdentityConfiguration>]]

The identity that the Spark job will use while running on compute.

inputs

log_files

Job output files.

Returns

Type Description

The dictionary of log names and URLs.

outputs

resources

The compute resource configuration for the job.

Returns

Type Description

The compute resource configuration for the job.

status

The status of the job.

Common values returned include "Running", "Completed", and "Failed". All possible values are:

  • NotStarted - This is a temporary state that client-side Run objects are in before cloud submission.

  • Starting - The Run has started being processed in the cloud. The caller has a run ID at this point.

  • Provisioning - On-demand compute is being created for a given job submission.

  • Preparing - The run environment is being prepared and is in one of two stages:

    • Docker image build

    • conda environment setup

  • Queued - The job is queued on the compute target. For example, in BatchAI, the job is in a queued state

    while waiting for all the requested nodes to be ready.

  • Running - The job has started to run on the compute target.

  • Finalizing - User code execution has completed, and the run is in post-processing stages.

  • CancelRequested - Cancellation has been requested for the job.

  • Completed - The run has completed successfully. This includes both the user code execution and run

    post-processing stages.

  • Failed - The run failed. Usually the Error property on a run will provide details as to why.

  • Canceled - Follows a cancellation request and indicates that the run is now successfully cancelled.

  • NotResponding - For runs that have Heartbeats enabled, no heartbeat has been recently sent.

Returns

Type Description

Status of the job.

studio_url

Azure ML studio endpoint.

Returns

Type Description

The URL to the job details page.

type

The type of the job.

Returns

Type Description

The type of the job.

CODE_ID_RE_PATTERN

CODE_ID_RE_PATTERN = re.compile('\\/subscriptions\\/(?P<subscription>[\\w,-]+)\\/resourceGroups\\/(?P<resource_group>[\\w,-]+)\\/providers\\/Microsoft\\.MachineLearningServices\\/workspaces\\/(?P<workspace>[\\w,-]+)\\/codes\\/(?P<co)