Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This quickstart explains how to create a Spark Job Definition that contains Python code with Spark Structured Streaming to land data in a lakehouse and then serve it through a SQL analytics endpoint. After completing this quickstart, you'll have a Spark Job Definition that runs continuously and the SQL analytics endpoint can view the incoming data.
Create a Python script
Use the following Python script to create a streaming Delta table in a lakehouse using Apache Spark. The script reads a stream of generated data (one row per second) and writes it in append mode to a Delta table named streamingtable. It stores the data and checkpoint info in the specified lakehouse.
Use the following Python code that uses Spark structured streaming to get data in a lakehouse table.
from pyspark.sql import SparkSession if __name__ == "__main__": # Start Spark session spark = SparkSession.builder \ .appName("RateStreamToDelta") \ .getOrCreate() # Table name used for logging tableName = "streamingtable" # Define Delta Lake storage path deltaTablePath = f"Tables/{tableName}" # Create a streaming DataFrame using the rate source df = spark.readStream \ .format("rate") \ .option("rowsPerSecond", 1) \ .load() # Write the streaming data to Delta query = df.writeStream \ .format("delta") \ .outputMode("append") \ .option("path", deltaTablePath) \ .option("checkpointLocation", f"{deltaTablePath}/_checkpoint") \ .start() # Keep the stream running query.awaitTermination()Save your script as Python file (.py) in your local computer.
Create a lakehouse
Use the following steps to create a lakehouse:
Sign in to the Microsoft Fabric portal.
Navigate to your desired workspace or create a new one if needed.
To create a lakehouse, select New item from the workspace, then select Lakehouse in the panel that opens.
Enter name of your lakehouse and select Create.
Create a Spark Job Definition
Use the following steps to create a Spark Job Definition:
From the same workspace where you created a lakehouse, select New item.
In the panel that opens, under Get data, select Spark Job Definition.
Enter name of your Spark Job Definition and select Create.
Select Upload and select the Python file you created in the previous step.
Under Lakehouse Reference choose the lakehouse you created.
Set Retry policy for Spark Job Definition
Use the following steps to set the retry policy for your Spark job definition:
From the top menu, select the Setting icon.
Open the Optimization tab and set Retry Policy trigger On.
Define maximum retry attempts or check Allow unlimited attempts.
Specify time between each retry attempt and select Apply.
Note
There is a lifetime limit of 90 days for the retry policy setup. Once the retry policy is enabled, the job will be restarted according to the policy within 90 days. After this period, the retry policy will automatically cease to function, and the job will be terminated. Users will then need to manually restart the job, which will, in turn, re-enable the retry policy.
Execute and monitor the Spark Job Definition
From the top menu, select the Run icon.
Verify if the Spark Job definition was submitted successfully and running.
View data using a SQL analytics endpoint
After the script runs, a table named streamingtable with timestamp and value columns is created in the lakehouse. You can view the data using the SQL analytics endpoint:
From the workspace, open your Lakehouse.
Switch to SQL analytics endpoint from the top-right corner.
From the left-navigation pane, expand Schemas > dbo >Tables, select streamingtable to preview the data.