SparkJob Class

A standalone Spark job.

Constructor

SparkJob(*, driver_cores: int | str | None = None, driver_memory: str | None = None, executor_cores: int | str | None = None, executor_memory: str | None = None, executor_instances: int | str | None = None, dynamic_allocation_enabled: bool | str | None = None, dynamic_allocation_min_executors: int | str | None = None, dynamic_allocation_max_executors: int | str | None = None, inputs: Dict[str, Input | str | bool | int | float] | None = None, outputs: Dict[str, Output] | None = None, compute: str | None = None, identity: Dict[str, str] | ManagedIdentityConfiguration | AmlTokenConfiguration | UserIdentityConfiguration | None = None, resources: Dict | SparkResourceConfiguration | None = None, **kwargs: Any)

Keyword-Only Parameters

Name	Description
driver_cores	Optional[int] The number of cores to use for the driver process, only in cluster mode. Default value: None
driver_memory	Optional[str] The amount of memory to use for the driver process, formatted as strings with a size unit suffix ("k", "m", "g" or "t") (e.g. "512m", "2g"). Default value: None
executor_cores	Optional[int] The number of cores to use on each executor. Default value: None
executor_memory	Optional[str] The amount of memory to use per executor process, formatted as strings with a size unit suffix ("k", "m", "g" or "t") (e.g. "512m", "2g"). Default value: None
executor_instances	Optional[int] The initial number of executors. Default value: None
dynamic_allocation_enabled	Optional[bool] Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload. Default value: None
dynamic_allocation_min_executors	Optional[int] The lower bound for the number of executors if dynamic allocation is enabled. Default value: None
dynamic_allocation_max_executors	Optional[int] The upper bound for the number of executors if dynamic allocation is enabled. Default value: None
inputs	Optional[dict[str, Input]] The mapping of input data bindings used in the job. Default value: None
outputs	Optional[dict[str, Output]] The mapping of output data bindings used in the job. Default value: None
compute	Optional[str] The compute resource the job runs on. Default value: None
identity	Optional[Union[dict[str, str], <xref:azure.ai.ml.ManagedIdentityConfiguration>, <xref:azure.ai.ml.AmlTokenConfiguration>, <xref:azure.ai.ml.UserIdentityConfiguration>]] The identity that the Spark job will use while running on compute. Default value: None
resources	Default value: None

Examples

Configuring a SparkJob.


   from azure.ai.ml import Input, Output
   from azure.ai.ml.entities import SparkJob

   spark_job = SparkJob(
       code="./sdk/ml/azure-ai-ml/tests/test_configs/dsl_pipeline/spark_job_in_pipeline/basic_src",
       entry={"file": "sampleword.py"},
       conf={
           "spark.driver.cores": 2,
           "spark.driver.memory": "1g",
           "spark.executor.cores": 1,
           "spark.executor.memory": "1g",
           "spark.executor.instances": 1,
       },
       environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:33",
       inputs={
           "input1": Input(
               type="uri_file", path="azureml://datastores/workspaceblobstore/paths/python/data.csv", mode="direct"
           )
       },
       compute="synapsecompute",
       outputs={"component_out_path": Output(type="uri_folder")},
       args="--input1 ${{inputs.input1}} --output2 ${{outputs.output1}} --my_sample_rate ${{inputs.sample_rate}}",
   )

Methods

dump	Dumps the job content into a file in YAML format.
filter_conf_fields	Filters out the fields of the conf attribute that are not among the Spark configuration fields listed in ~azure.ai.ml._schema.job.parameterized_spark.CONF_KEY_MAP and returns them in their own dictionary.

dump

Dumps the job content into a file in YAML format.

dump(dest: str | PathLike | IO, **kwargs: Any) -> None

Parameters

Name	Description
dest Required	Union[<xref:PathLike>, str, IO[AnyStr]] The local path or file stream to write the YAML content to. If dest is a file path, a new file will be created. If dest is an open file, the file will be written to directly.

Exceptions

Type	Description
FileExistsError	Raised if dest is a file path and the file already exists.
IOError	Raised if dest is an open file and the file is not writable.

filter_conf_fields

Filters out the fields of the conf attribute that are not among the Spark configuration fields listed in ~azure.ai.ml._schema.job.parameterized_spark.CONF_KEY_MAP and returns them in their own dictionary.

filter_conf_fields() -> Dict[str, str]

Returns

Type	Description
dict[str, str]	A dictionary of the conf fields that are not Spark configuration fields.

Attributes

base_path

The base path of the resource.

Returns

Type	Description
str	The base path of the resource.

creation_context

The creation context of the resource.

Returns

Type	Description
Optional[SystemData]	The creation metadata for the resource.

entry

environment

The Azure ML environment to run the Spark component or job in.

Returns

Type	Description
Optional[Union[str, Environment]]	The Azure ML environment to run the Spark component or job in.

id

The resource ID.

Returns

Type	Description
Optional[str]	The global ID of the resource, an Azure Resource Manager (ARM) ID.

identity

The identity that the Spark job will use while running on compute.

Returns

Type	Description
Optional[Union[<xref:azure.ai.ml.ManagedIdentityConfiguration>, <xref:azure.ai.ml.AmlTokenConfiguration>, <xref:azure.ai.ml.UserIdentityConfiguration>]]	The identity that the Spark job will use while running on compute.

inputs

log_files

Job output files.

Returns

Type	Description
Optional[Dict[str, str]]	The dictionary of log names and URLs.

outputs

resources

The compute resource configuration for the job.

Returns

Type	Description
Optional[SparkResourceConfiguration]	The compute resource configuration for the job.

status

The status of the job.

Common values returned include "Running", "Completed", and "Failed". All possible values are:

NotStarted - This is a temporary state that client-side Run objects are in before cloud submission.
Starting - The Run has started being processed in the cloud. The caller has a run ID at this point.
Provisioning - On-demand compute is being created for a given job submission.
Preparing - The run environment is being prepared and is in one of two stages:
- Docker image build
- conda environment setup
Queued - The job is queued on the compute target. For example, in BatchAI, the job is in a queued state

while waiting for all the requested nodes to be ready.
Running - The job has started to run on the compute target.
Finalizing - User code execution has completed, and the run is in post-processing stages.
CancelRequested - Cancellation has been requested for the job.
Completed - The run has completed successfully. This includes both the user code execution and run

post-processing stages.
Failed - The run failed. Usually the Error property on a run will provide details as to why.
Canceled - Follows a cancellation request and indicates that the run is now successfully cancelled.
NotResponding - For runs that have Heartbeats enabled, no heartbeat has been recently sent.

Returns

Type	Description
Optional[str]	Status of the job.

studio_url

Azure ML studio endpoint.

Returns

Type	Description
Optional[str]	The URL to the job details page.

type

The type of the job.

Returns

Type	Description
Optional[str]	The type of the job.

CODE_ID_RE_PATTERN

CODE_ID_RE_PATTERN = re.compile('\\/subscriptions\\/(?P<subscription>[\\w,-]+)\\/resourceGroups\\/(?P<resource_group>[\\w,-]+)\\/providers\\/Microsoft\\.MachineLearningServices\\/workspaces\\/(?P<workspace>[\\w,-]+)\\/codes\\/(?P<co)

Feedback

Was this page helpful?

Share via

SparkJob Class

Constructor

Keyword-Only Parameters

Examples

Methods

dump

Parameters

Exceptions

filter_conf_fields

Returns

Attributes

base_path

Returns

creation_context

Returns

entry

environment

Returns

id

Returns

identity

Returns

inputs

log_files

Returns

outputs

resources

Returns

status

Returns

studio_url

Returns

type

Returns

CODE_ID_RE_PATTERN

Feedback