Share via


Tutorial: Run a real-time streaming workload

Real-time mode enables ultra-low latency streaming with end-to-end latency as low as five milliseconds, making it ideal for operational workloads like fraud detection and real-time personalization. This tutorial guides you through setting up your first real-time streaming query using a simple example.

For conceptual information about real-time mode, when to use it, and supported features, see Real-time mode in Structured Streaming. For configuration requirements, see Set up real-time mode.

Requirements

Before you begin, ensure you have permissions to create a classic compute cluster that uses the configuration specified in Set up real-time mode. Alternatively, contact your workspace administrator to create a real-time mode cluster for you.

Step 1: Create a notebook

Notebooks provide an interactive environment for developing and testing streaming queries. You use this notebook to write your real-time query and see the results update continuously.

To create a notebook:

  1. Click New in the sidebar, then click Notebook icon. Notebook.
  2. In the compute drop-down menu, select your real-time mode cluster.
  3. Select Python or Scala as the default language.

Step 2: Run a real-time mode query

Copy and paste the following code into a notebook cell and run it. This example uses a rate source, which generates rows at a specified rate, and displays the results in real time.

Note

The display function with realTime trigger is available in Databricks Runtime 17.1 and above.

Python

inputDF = (
  spark
  .readStream
  .format("rate")
  .option("numPartitions", 2)
  .option("rowsPerSecond", 1)
  .load()
)
display(inputDF, realTime="5 minutes", outputMode="update")

Scala

import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode

val inputDF = spark
  .readStream
  .format("rate")
  .option("numPartitions", 2)
  .option("rowsPerSecond", 1)
  .load()
display(inputDF, trigger=Trigger.RealTime(), outputMode=OutputMode.Update())

After running the code, you see a table that updates in real time as new rows are generated. The table displays a timestamp column and a value column that increments with each row.

Understanding the code

The code above demonstrates the essential components of a real-time streaming query. The following tables explain the key parameters and what they control:

Python

Parameter Description
format("rate") Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.
numPartitions Sets the number of partitions for the generated data.
rowsPerSecond Controls how many rows are generated per second.
realTime="5 minutes" Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures.
outputMode="update" Real-time mode requires update output mode.

Scala

Parameter Description
format("rate") Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.
numPartitions Sets the number of partitions for the generated data.
rowsPerSecond Controls how many rows are generated per second.
Trigger.RealTime() Enables real-time mode with the default checkpoint interval. You can also specify an interval, for example Trigger.RealTime("5 minutes").
OutputMode.Update() Real-time mode requires update output mode.

Step 3: Validate results

When you run the query, the display function creates a table that updates in real time as the rate source generates new rows. Each row contains:

  • A timestamp for when the row was generated by the rate source.
  • A monotonically increasing counter that increments with each new row.

The table updates continuously with minimal latency, demonstrating how real-time mode processes data as soon as it becomes available. This is the core benefit of real-time mode - the ability to see and act on data immediately rather than waiting for batch processing.

Additional resources

Now that you've run your first real-time query, explore these resources to build production streaming applications with Kafka, Kinesis, and other supported sources: