Get streaming data into lakehouse and access with SQL analytics endpoint

Article
05/13/2024

This quickstart explains how to create a Spark Job Definition that contains Python code with Spark Structured Streaming to land data in a lakehouse and then serve it through a SQL analytics endpoint. After completing this quickstart, you'll have a Spark Job Definition that runs continuously and the SQL analytics endpoint can view the incoming data.

Create a Python script

Use the following Python code that uses Spark structured streaming to get data in a lakehouse table.

import sys
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder.appName("MyApp").getOrCreate()

    tableName = "streamingtable"
    deltaTablePath = "Tables/" + tableName

    df = spark.readStream.format("rate").option("rowsPerSecond", 1).load()

    query = df.writeStream.outputMode("append").format("delta").option("path", deltaTablePath).option("checkpointLocation", deltaTablePath + "/checkpoint").start()
    query.awaitTermination()

Save your script as Python file (.py) in your local computer.

Create a lakehouse

Use the following steps to create a lakehouse:

In Microsoft Fabric, select Synapse Data Engineering.
Navigate to your desired workspace or create a new one if needed.
To create a lakehouse, select the Lakehouse icon under the New section in the main pane.
Enter name of your lakehouse and select Create.

Create a Spark Job Definition

Use the following steps to create a Spark Job Definition:

From the same workspace where you created a lakehouse, select the Create icon from the left menu.
Under "Data Engineering", select Spark Job Definition.
Enter name of your Spark Job Definition and select Create.
Select Upload and select the Python file you created in the previous step.
Under Lakehouse Reference choose the lakehouse you created.

Set Retry policy for Spark Job Definition

Use the following steps to set the retry policy for your Spark job definition:

From the top menu, select the Setting icon.
Open the Optimization tab and set Retry Policy trigger On.
Define maximum retry attempts or check Allow unlimited attempts.
Specify time between each retry attempt and select Apply.

Note

There is a lifetime limit of 90 days for the retry policy setup. Once the retry policy is enabled, the job will be restarted according to the policy within 90 days. After this period, the retry policy will automatically cease to function, and the job will be terminated. Users will then need to manually restart the job, which will, in turn, re-enable the retry policy.

Execute and monitor the Spark Job Definition

From the top menu, select the Run icon.
Verify if the Spark Job definition was submitted successfully and running.

View data using a SQL analytics endpoint

In workspace view, select your Lakehouse.
From the right corner, select Lakehouse and select SQL analytics endpoint.
In the SQL analytics endpoint view under Tables, select the table that your script uses to land data. You can then preview your data from the SQL analytics endpoint.

Share via

Get streaming data into lakehouse and access with SQL analytics endpoint

Create a Python script

Create a lakehouse

Create a Spark Job Definition

Set Retry policy for Spark Job Definition

Execute and monitor the Spark Job Definition

View data using a SQL analytics endpoint

Feedback

Additional resources

Share via

Get streaming data into lakehouse and access with SQL analytics endpoint

Create a Python script

Create a lakehouse

Create a Spark Job Definition

Set Retry policy for Spark Job Definition

Execute and monitor the Spark Job Definition

View data using a SQL analytics endpoint

Related content

Feedback

Additional resources