Configure Apache Spark jobs in Azure Machine Learning

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy access to distributed computing capability - backed by Azure Synapse - to scale Apache Spark jobs on Azure Machine Learning.

In this article, you learn how to submit a Spark job using Azure Machine Learning serverless Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough in a few simple steps.

For more information about Apache Spark in Azure Machine Learning concepts, visit this resource.

Prerequisites

APPLIES TO: Azure CLI ml extension v2 (current)

Add role assignments in Azure storage accounts

Before we submit an Apache Spark job, we must ensure that the input and output data paths are accessible. Assign Contributor and Storage Blob Data Contributor roles to the user identity of the logged-in user, to enable read and write access.

To assign appropriate roles to the user identity:

  1. Open the Microsoft Azure portal.

  2. Search for, and select, the Storage accounts service.

    Expandable screenshot showing search for and selection of Storage accounts service, in Microsoft Azure portal.

  3. On the Storage accounts page, select the Azure Data Lake Storage (ADLS) Gen 2 storage account from the list. A page showing Overview of the storage account opens.

    Expandable screenshot showing selection of Azure Data Lake Storage (ADLS) Gen 2 storage account  Storage account.

  4. Select Access Control (IAM) from the left panel.

  5. Select Add role assignment.

    Expandable screenshot showing the Azure access keys screen.

  6. Search for the Storage Blob Data Contributor role.

  7. Select the Storage Blob Data Contributor role.

  8. Select Next.

    Expandable screenshot showing the Azure add role assignment screen.

  9. Select User, group, or service principal.

  10. Select + Select members.

  11. In the textbox under Select, search for the user identity.

  12. Select the user identity from the list, so that it shows under Selected members.

  13. Select the appropriate user identity.

  14. Select Next.

    Expandable screenshot showing the Azure add role assignment screen Members tab.

  15. Select Review + Assign.

    Expandable screenshot showing the Azure add role assignment screen review and assign tab.

  16. Repeat steps 2-13 for Storage Blob Contributor role assignment.

Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become accessible once the user identity has the appropriate roles assigned.

Create parametrized Python code

A Spark job requires a Python script that accepts arguments. To build this script, you can modify the Python code developed from interactive data wrangling. A sample Python script is shown here:

# titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")

args = parser.parse_args()
print(args.wrangled_data)
print(args.titanic_data)

df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
    "mean"
)  # Replace missing values in Age column with the mean value
df.fillna(
    value={"Cabin": "None"}, inplace=True
)  # Fill Cabin column with value "None" if missing
df.dropna(inplace=True)  # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")

Note

  • This Python code sample uses pyspark.pandas, which only Spark runtime version 3.2 supports.
  • Please ensure that the titanic.py file is uploaded to a folder named src. The src folder should be located in the same directory where you have created the Python script/notebook or the YAML specification file that defines the standalone Spark job.

That script takes two arguments: --titanic_data and --wrangled_data. These arguments pass the input data path, and the output folder, respectively. The script uses the titanic.csv file, available here. Upload this file to a container created in the Azure Data Lake Storage (ADLS) Gen 2 storage account.

Submit a standalone Spark job

APPLIES TO: Azure CLI ml extension v2 (current)

This example YAML specification shows a standalone Spark job. It uses an Azure Machine Learning serverless Spark compute, user identity passthrough, and input/output data URI in the abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_TO_DATA> format. Here, <FILE_SYSTEM_NAME> matches the container name.

$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark

code: ./src 
entry:
  file: titanic.py

conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.executor.instances: 2

inputs:
  titanic_data:
    type: uri_file
    path: abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/titanic.csv
    mode: direct

outputs:
  wrangled_data:
    type: uri_folder
    path: abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/wrangled/
    mode: direct

args: >-
  --titanic_data ${{inputs.titanic_data}}
  --wrangled_data ${{outputs.wrangled_data}}

identity:
  type: user_identity

resources:
  instance_type: standard_e4s_v3
  runtime_version: "3.2"

In the above YAML specification file:

  • the code property defines relative path of the folder containing parameterized titanic.py file.
  • the resource property defines the instance_type and the Apache Spark runtime_version values that serverless Spark compute uses. These instance type values are currently supported:
    • standard_e4s_v3
    • standard_e8s_v3
    • standard_e16s_v3
    • standard_e32s_v3
    • standard_e64s_v3

The YAML file shown can be used in the az ml job create command, with the --file parameter, to create a standalone Spark job as shown:

az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --subscription <SUBSCRIPTION_ID> --resource-group <RESOURCE_GROUP> --workspace-name <AML_WORKSPACE_NAME>

Tip

You might have an existing Synapse Spark pool in your Azure Synapse workspace. To use an existing Synapse Spark pool, please follow the instructions to attach a Synapse Spark pool in Azure Machine Learning workspace.

Next steps