Configure Apache Spark jobs in Azure Machine Learning

Article
08/28/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy access to distributed computing capability - backed by Azure Synapse - to scale Apache Spark jobs on Azure Machine Learning.

In this article, you learn how to submit a Spark job using Azure Machine Learning serverless Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough in a few simple steps.

For more information about Apache Spark in Azure Machine Learning concepts, visit this resource.

Prerequisites

APPLIES TO: Azure CLI ml extension v2 (current)

An Azure subscription; if you don't have an Azure subscription, create a free account before you begin.
An Azure Machine Learning workspace. For more information, visit Create workspace resources.
An Azure Data Lake Storage (ADLS) Gen 2 storage account. For more information, visit Create an Azure Data Lake Storage (ADLS) Gen 2 storage account.
Create an Azure Machine Learning compute instance.
Install Azure Machine Learning CLI.

Add role assignments in Azure storage accounts

Before we submit an Apache Spark job, we must ensure that the input and output data paths are accessible. Assign Contributor and Storage Blob Data Contributor roles to the user identity of the logged-in user, to enable read and write access.

To assign appropriate roles to the user identity:

Open the Microsoft Azure portal.
Search for, and select, the Storage accounts service.
On the Storage accounts page, select the Azure Data Lake Storage (ADLS) Gen 2 storage account from the list. A page showing Overview of the storage account opens.
Select Access Control (IAM) from the left panel.
Select Add role assignment.
Search for the Storage Blob Data Contributor role.
Select the Storage Blob Data Contributor role.
Select Next.
Select User, group, or service principal.
Select + Select members.
In the textbox under Select, search for the user identity.
Select the user identity from the list, so that it shows under Selected members.
Select the appropriate user identity.
Select Next.
Select Review + Assign.
Repeat steps 2-13 for Storage Blob Contributor role assignment.

Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become accessible once the user identity has the appropriate roles assigned.

Create parametrized Python code

A Spark job requires a Python script that accepts arguments. To build this script, you can modify the Python code developed from interactive data wrangling. A sample Python script is shown here:

# titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")

args = parser.parse_args()
print(args.wrangled_data)
print(args.titanic_data)

df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
    "mean"
)  # Replace missing values in Age column with the mean value
df.fillna(
    value={"Cabin": "None"}, inplace=True
)  # Fill Cabin column with value "None" if missing
df.dropna(inplace=True)  # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")

Note

This Python code sample uses pyspark.pandas, which only Spark runtime version 3.2 supports.
Please ensure that the titanic.py file is uploaded to a folder named src. The src folder should be located in the same directory where you have created the Python script/notebook or the YAML specification file that defines the standalone Spark job.

That script takes two arguments: --titanic_data and --wrangled_data. These arguments pass the input data path, and the output folder, respectively. The script uses the titanic.csv file, available here. Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account.

Submit a standalone Spark job

APPLIES TO: Azure CLI ml extension v2 (current)

Tip

You can submit a Spark job from:

the terminal of an Azure Machine Learning compute instance.
the terminal of Visual Studio Code, connected to an Azure Machine Learning compute instance.
your local computer that has the Azure Machine Learning CLI installed.

This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA> format. Here, <FILE_SYSTEM_NAME> matches the container name.

$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark

code: ./src 
entry:
  file: titanic.py

conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.executor.instances: 2

inputs:
  titanic_data:
    type: uri_file
    path: abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/titanic.csv
    mode: direct

outputs:
  wrangled_data:
    type: uri_folder
    path: abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/wrangled/
    mode: direct

args: >-
  --titanic_data ${{inputs.titanic_data}}
  --wrangled_data ${{outputs.wrangled_data}}

identity:
  type: user_identity

resources:
  instance_type: standard_e4s_v3
  runtime_version: "3.2"

In the above YAML specification file:

the code property defines relative path of the folder containing parameterized titanic.py file.
the resource property defines the instance_type and the Apache Spark runtime_version values that serverless Spark compute uses. These instance type values are currently supported:
- standard_e4s_v3
- standard_e8s_v3
- standard_e16s_v3
- standard_e32s_v3
- standard_e64s_v3

The YAML file shown can be used in the az ml job create command, with the --file parameter, to create a standalone Spark job as shown:

az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --subscription <SUBSCRIPTION_ID> --resource-group <RESOURCE_GROUP> --workspace-name <AML_WORKSPACE_NAME>

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Tip

You can submit a Spark job from:

an Azure Machine Learning Notebook connected to an Azure Machine Learning compute instance.
Visual Studio Code, connected to an Azure Machine Learning compute instance.
your local computer that has the Azure Machine Learning SDK for Python installed.

This Python code snippet shows a standalone Spark job creation. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA> format. Here, the <FILE_SYSTEM_NAME> matches the container name.

from azure.ai.ml import MLClient, spark, Input, Output
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import UserIdentityConfiguration

subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<AML_WORKSPACE_NAME>"
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace_name
)

spark_job = spark(
    display_name="Titanic-Spark-Job-SDK",
    code="./src",
    entry={"file": "titanic.py"},
    driver_cores=1,
    driver_memory="2g",
    executor_cores=2,
    executor_memory="2g",
    executor_instances=2,
    resources={
        "instance_type": "Standard_E8S_V3",
        "runtime_version": "3.2.0",
    },
    inputs={
        "titanic_data": Input(
            type="uri_file",
            path="abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/titanic.csv",
            mode="direct",
        ),
    },
    outputs={
        "wrangled_data": Output(
            type="uri_folder",
            path="abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/wrangled/",
            mode="direct",
        ),
    },
    identity=UserIdentityConfiguration(),
    args="--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}",
)

returned_spark_job = ml_client.jobs.create_or_update(spark_job)

# Wait until the job completes
ml_client.jobs.stream(returned_spark_job.name)

In the above code sample:

the code parameter defines the relative path of the folder containing parameterized titanic.py file.
the resource parameter that defines the instance_type and the Apache Spark runtime_version that the serverless Spark compute (preview) uses. These instance type values are currently supported:
- Standard_E4S_V3
- Standard_E8S_V3
- Standard_E16S_V3
- Standard_E32S_V3
- Standard_E64S_V3

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

First, upload the parameterized Python code titanic.py to the Azure Blob storage container for the workspace default workspaceblobstore datastore. To submit a standalone Spark job using the Azure Machine Learning studio UI:

Select + New, located near the top right side of the screen.
Select Spark job (preview).
On the Compute screen:
1. Under Select compute type, select Spark serverless for serverless Spark compute.
2. Select Virtual machine size. These instance types are currently supported:
  - Standard_E4s_v3
  - Standard_E8s_v3
  - Standard_E16s_v3
  - Standard_E32s_v3
  - Standard_E64s_v3
3. Select Spark runtime version as Spark 3.2.
4. Select Next.
On the Environment screen, select Next.
On the Job settings screen:
1. Provide a job Name, or use the job Name, which is generated by default.
2. Select an Experiment name from the dropdown menu.
3. Under Add tags, provide Name and Value, then select Add. Adding tags is optional.
4. Under the Code section:
  1. Select Azure Machine Learning workspace default blob storage from Choose code location dropdown.
  2. Under Path to code file to upload, select Browse.
  3. In the pop-up screen titled Path selection, select the path of the titanic.pycode file on the workspace workspaceblobstore default datastore.
  4. Select Save.
  5. Input titanic.py as the name of the Entry file for the standalone job.
  6. To add an input, select + Add input under Inputs and
    1. Enter Input name as titanic_data. The input should refer to this name later in the Arguments.
    2. Select Input type as Data.
    3. Select Data type as File.
    4. Select Data source as URI.
    5. Enter an Azure Data Lake Storage (ADLS) Gen 2 data URI for titanic.csv file in the abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA> format. Here, <FILE_SYSTEM_NAME> matches the container name.
  7. To add an input, select + Add output under Outputs and
    1. Enter Output name as wrangled_data. The output should refer to this name later in the Arguments.
    2. Select Output type as Folder.
    3. For Output URI destination, enter an Azure Data Lake Storage (ADLS) Gen 2 folder URI in the abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA> format. Here, <FILE_SYSTEM_NAME> matches the container name.
  8. Enter Arguments as --titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}.
5. Under the Spark configurations section:
  1. For Executor size:
    1. Enter the number of executor Cores as 2 and executor Memory (GB) as 2.
    2. For Dynamically allocated executors, select Disabled.
    3. Enter the number of Executor instances as 2.
  2. For Driver size, enter number of driver Cores as 1 and driver Memory (GB) as 2.
6. Select Next.
On the Review screen:
1. Review the job specification before submitting it.
2. Select Create to submit the standalone Spark job.

Note

A standalone job submitted from the Studio UI, using an Azure Machine Learning serverless Spark compute, defaults to the user identity passthrough for data access.

Tip

You might have an existing Synapse Spark pool in your Azure Synapse workspace. To use an existing Synapse Spark pool, please follow the instructions to attach a Synapse Spark pool in Azure Machine Learning workspace.

Share via

Configure Apache Spark jobs in Azure Machine Learning

Prerequisites

Add role assignments in Azure storage accounts

Create parametrized Python code

Submit a standalone Spark job

Next steps

Feedback

Additional resources