Get streaming data into lakehouse and access with SQL analytics endpoint

This quickstart explains how to create a Spark Job Definition that contains Python code with Spark Structured Streaming to land data in a lakehouse and then serve it through a SQL analytics endpoint. After completing this quickstart, you'll have a Spark Job Definition that runs continuously and the SQL analytics endpoint can view the incoming data.

Create a Python script

  1. Use the following Python code that uses Spark structured streaming to get data in a lakehouse table.

    import sys
    from pyspark.sql import SparkSession
    
    if __name__ == "__main__":
        spark = SparkSession.builder.appName("MyApp").getOrCreate()
    
        tableName = "streamingtable"
        deltaTablePath = "Tables/" + tableName
    
        df = spark.readStream.format("rate").option("rowsPerSecond", 1).load()
    
        query = df.writeStream.outputMode("append").format("delta").option("path", deltaTablePath).option("checkpointLocation", deltaTablePath + "/checkpoint").start()
        query.awaitTermination()
    
  2. Save your script as Python file (.py) in your local computer.

Create a lakehouse

Use the following steps to create a lakehouse:

  1. In Microsoft Fabric, select Synapse Data Engineering.

  2. Navigate to your desired workspace or create a new one if needed.

  3. To create a lakehouse, select the Lakehouse icon under the New section in the main pane.

    Screenshot showing new lakehouse dialog

  4. Enter name of your lakehouse and select Create.

Create a Spark Job Definition

Use the following steps to create a Spark Job Definition:

  1. From the same workspace where you created a lakehouse, select the Create icon from the left menu.

  2. Under "Data Engineering", select Spark Job Definition.

    Screenshot showing new Spark Job Definition dialog

  3. Enter name of your Spark Job Definition and select Create.

  4. Select Upload and select the Python file you created in the previous step.

  5. Under Lakehouse Reference choose the lakehouse you created.

Set Retry policy for Spark Job Definition

Use the following steps to set the retry policy for your Spark job definition:

  1. From the top menu, select the Setting icon.

    Screenshot showing Spark Job Definition settings icon

  2. Open the Optimization tab and set Retry Policy trigger On.

    Screenshot showing Spark Job Definition optimization tab

  3. Define maximum retry attempts or check Allow unlimited attempts.

  4. Specify time between each retry attempt and select Apply.

Note

There is a lifetime limit of 90 days for the retry policy setup. Once the retry policy is enabled, the job will be restarted according to the policy within 90 days. After this period, the retry policy will automatically cease to function, and the job will be terminated. Users will then need to manually restart the job, which will, in turn, re-enable the retry policy.

Execute and monitor the Spark Job Definition

  1. From the top menu, select the Run icon.

    Screenshot showing Spark Job Definition run icon

  2. Verify if the Spark Job definition was submitted successfully and running.

View data using a SQL analytics endpoint

  1. In workspace view, select your Lakehouse.

  2. From the right corner, select Lakehouse and select SQL analytics endpoint.

  3. In the SQL analytics endpoint view under Tables, select the table that your script uses to land data. You can then preview your data from the SQL analytics endpoint.