How to create an Apache Spark job definition in Fabric
In this tutorial, learn how to create a Spark job definition in Microsoft Fabric.
Prerequisites
Before you get started, you need:
- A Fabric tenant account with an active subscription. Create an account for free.
Tip
To run the Spark job definition item, you must have a main definition file and default lakehouse context. If you don't have a lakehouse, you can create one by following the steps in Create a lakehouse.
Create a Spark job definition
The Spark job definition creation process is quick and simple; there are several ways to get started.
Options to create a Spark job definition
There are a few ways you can get started with the creation process:
Data engineering homepage: You can easily create a Spark job definition through the Spark Job Definition card under the New section in the homepage.
Workspace view: You can also create a Spark job definition through the Workspace in Data Engineering by using the New dropdown menu.
Create view: Another entry point to create a Spark job definition is the Create page under Data Engineering.
You need to give your Spark job definition a name when you create it. The name must be unique within the current workspace. The new Spark job definition is created in your current workspace.
Create a Spark job definition for PySpark (Python)
To create a Spark job definition for PySpark:
Download the sample Parquet file yellow_tripdata_2022-01.parquet and upload it to the files section of the lakehouse.
Create a new Spark job definition.
Select PySpark (Python) from the Language dropdown.
Download the createTablefromParquet.py sample and upload it as the main definition file. The main definition file (job.Main) is the file that contains the application logic and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file.
You can upload the main definition file from your local desktop, or you can upload from an existing Azure Data Lake Storage (ADLS) Gen2 by providing the full ABFSS path of the file. For example,
abfss://your-storage-account-name.dfs.core.windows.net/your-file-path
.Upload reference files as .py files. The reference files are the python modules that are imported by the main definition file. Just like the main definition file, you can upload from your desktop or an existing ADLS Gen2. Multiple reference files are supported.
Tip
If you use an ADLS Gen2 path, to make sure the file is accessible, you must give the user account that runs the job the proper permission to the storage account. We suggest two different ways to do this:
- Assign the user account a Contributor role for the storage account.
- Grant Read and Execution permission to the user account for the file via the ADLS Gen2 Access Control List (ACL).
For a manual run, the account of the current login user is used to run the job.
Provide command line arguments for the job, if needed. Use a space as a splitter to separate the arguments.
Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.
Multiple lakehouse references are supported. Find the non-default lakehouse name and full OneLake URL in the Spark Settings page.
Create a Spark job definition for Scala/Java
To create a Spark job definition for Scala/Java:
Create a new Spark job definition.
Select Spark(Scala/Java) from the Language dropdown.
Upload the main definition file as a .jar file. The main definition file is the file that contains the application logic of this job and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file. Provide the Main class name.
Upload reference files as .jar files. The reference files are the files that are referenced/imported by the main definition file.
Provide command line arguments for the job, if needed.
Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.
Create a Spark job definition for R
To create a Spark job definition for SparkR(R):
Create a new Spark job definition.
Select SparkR(R) from the Language dropdown.
Upload the main definition file as an .R file. The main definition file is the file that contains the application logic of this job and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file.
Upload reference files as .R files. The reference files are the files that are referenced/imported by the main definition file.
Provide command line arguments for the job, if needed.
Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.
Note
The Spark job definition will be created in your current workspace.
Options to customize Spark job definitions
There are a few options to further customize the execution of Spark job definitions.
- Spark Compute: Within the Spark Compute tab, you can see the Runtime Version which is the version of Spark that will be used to run the job. You can also see the Spark configuration settings that will be used to run the job. You can customize the Spark configuration settings by clicking on the Add button.
Optimization: On the Optimization tab, you can enable and set up the Retry Policy for the job. When enabled, the job is retried if it fails. You can also set the maximum number of retries and the interval between retries. For each retry attempt, the job is restarted. Make sure the job is idempotent.