Dogodek
31. mar., 23h - 2. apr., 23h
Največji učni dogodek Fabric, Power BI in SQL. 31. marec - 2. april Če želite shraniti 400 $, uporabite kodo FABINSIDER.
Registrirajte se danesTa brskalnik ni več podprt.
Izvedite nadgradnjo na Microsoft Edge, če želite izkoristiti vse prednosti najnovejših funkcij, varnostnih posodobitev in tehnične podpore.
In this tutorial, learn how to create a Spark job definition in Microsoft Fabric.
Before you get started, you need:
Nasvet
To run the Spark job definition item, you must have a main definition file and default lakehouse context. If you don't have a lakehouse, you can create one by following the steps in Create a lakehouse.
The Spark job definition creation process is quick and simple; there are several ways to get started.
There are two ways you can get started with the creation process:
Workspace view: You can easily create a Spark job definition through the Fabric workspace by selecting New item > Spark Job Definition.
Fabric Home: Another entry point to create a Spark job definition is the Data analytics using a SQL ... tile on the Fabric home page. You can find the same option by selecting the General tile.
You need to give your Spark job definition a name when you create it. The name must be unique within the current workspace. The new Spark job definition is created in your current workspace.
To create a Spark job definition for PySpark:
Download the sample Parquet file yellow_tripdata_2022-01.parquet and upload it to the files section of the lakehouse.
Create a new Spark job definition.
Select PySpark (Python) from the Language dropdown.
Download the createTablefromParquet.py sample and upload it as the main definition file. The main definition file (job.Main) is the file that contains the application logic and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file.
You can upload the main definition file from your local desktop, or you can upload from an existing Azure Data Lake Storage (ADLS) Gen2 by providing the full ABFSS path of the file. For example, abfss://your-storage-account-name.dfs.core.windows.net/your-file-path
.
Upload reference files as .py files. The reference files are the python modules that are imported by the main definition file. Just like the main definition file, you can upload from your desktop or an existing ADLS Gen2. Multiple reference files are supported.
Nasvet
If you use an ADLS Gen2 path, to make sure the file is accessible, you must give the user account that runs the job the proper permission to the storage account. We suggest two different ways to do this:
For a manual run, the account of the current login user is used to run the job.
Provide command line arguments for the job, if needed. Use a space as a splitter to separate the arguments.
Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.
Multiple lakehouse references are supported. Find the non-default lakehouse name and full OneLake URL in the Spark Settings page.
To create a Spark job definition for Scala/Java:
Create a new Spark job definition.
Select Spark(Scala/Java) from the Language dropdown.
Upload the main definition file as a .jar file. The main definition file is the file that contains the application logic of this job and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file. Provide the Main class name.
Upload reference files as .jar files. The reference files are the files that are referenced/imported by the main definition file.
Provide command line arguments for the job, if needed.
Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.
To create a Spark job definition for SparkR(R):
Create a new Spark job definition.
Select SparkR(R) from the Language dropdown.
Upload the main definition file as an .R file. The main definition file is the file that contains the application logic of this job and is mandatory to run a Spark job. For each Spark job definition, you can only upload one main definition file.
Upload reference files as .R files. The reference files are the files that are referenced/imported by the main definition file.
Provide command line arguments for the job, if needed.
Add the lakehouse reference to the job. You must have at least one lakehouse reference added to the job. This lakehouse is the default lakehouse context for the job.
Opomba
The Spark job definition will be created in your current workspace.
There are a few options to further customize the execution of Spark job definitions.
Optimization: On the Optimization tab, you can enable and set up the Retry Policy for the job. When enabled, the job is retried if it fails. You can also set the maximum number of retries and the interval between retries. For each retry attempt, the job is restarted. Make sure the job is idempotent.
Dogodek
31. mar., 23h - 2. apr., 23h
Največji učni dogodek Fabric, Power BI in SQL. 31. marec - 2. april Če želite shraniti 400 $, uporabite kodo FABINSIDER.
Registrirajte se danesUsposabljanje
Modul
Use Apache Spark in Microsoft Fabric - Training
Apache Spark is a core technology for large-scale data analytics. Microsoft Fabric provides support for Spark clusters, enabling you to analyze and process data at scale.
Potrdilo
Microsoft Certified: Fabric Data Engineer Associate - Certifications
As a Fabric Data Engineer, you should have subject matter expertise with data loading patterns, data architectures, and orchestration processes.