नोट
इस पेज तक पहुँच के लिए प्रमाणन की आवश्यकता होती है. आप साइन इन करने या निर्देशिकाओं को बदलने का प्रयास कर सकते हैं.
इस पेज तक पहुँच के लिए प्रमाणन की आवश्यकता होती है. आप निर्देशिकाओं को बदलने का प्रयास कर सकते हैं.
Real-time mode enables ultra-low latency streaming with end-to-end latency as low as five milliseconds, making it ideal for operational workloads like fraud detection and real-time personalization. This tutorial guides you through setting up your first real-time streaming query using a simple example.
For conceptual information about real-time mode, when to use it, and supported features, see Real-time mode in Structured Streaming. For configuration requirements, see Set up real-time mode.
Requirements
Before you begin, ensure you have permissions to create a classic compute cluster that uses the configuration specified in Set up real-time mode. Alternatively, contact your workspace administrator to create a real-time mode cluster for you.
Step 1: Create a notebook
Notebooks provide an interactive environment for developing and testing streaming queries. You use this notebook to write your real-time query and see the results update continuously.
To create a notebook:
- Click New in the sidebar, then click
Notebook.
- In the compute drop-down menu, select your real-time mode cluster.
- Select Python or Scala as the default language.
Step 2: Run a real-time mode query
Copy and paste the following code into a notebook cell and run it. This example uses a rate source, which generates rows at a specified rate, and displays the results in real time.
Note
The display function with realTime trigger is available in Databricks Runtime 17.1 and above.
Python
inputDF = (
spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
)
display(inputDF, realTime="5 minutes", outputMode="update")
Scala
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
val inputDF = spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
display(inputDF, trigger=Trigger.RealTime(), outputMode=OutputMode.Update())
After running the code, you see a table that updates in real time as new rows are generated. The table displays a timestamp column and a value column that increments with each row.
Understanding the code
The code above demonstrates the essential components of a real-time streaming query. The following tables explain the key parameters and what they control:
Python
| Parameter | Description |
|---|---|
format("rate") |
Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies. |
numPartitions |
Sets the number of partitions for the generated data. |
rowsPerSecond |
Controls how many rows are generated per second. |
realTime="5 minutes" |
Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures. |
outputMode="update" |
Real-time mode requires update output mode. |
Scala
| Parameter | Description |
|---|---|
format("rate") |
Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies. |
numPartitions |
Sets the number of partitions for the generated data. |
rowsPerSecond |
Controls how many rows are generated per second. |
Trigger.RealTime() |
Enables real-time mode with the default checkpoint interval. You can also specify an interval, for example Trigger.RealTime("5 minutes"). |
OutputMode.Update() |
Real-time mode requires update output mode. |
Step 3: Validate results
When you run the query, the display function creates a table that updates in real time as the rate source generates new rows. Each row contains:
- A timestamp for when the row was generated by the rate source.
- A monotonically increasing counter that increments with each new row.
The table updates continuously with minimal latency, demonstrating how real-time mode processes data as soon as it becomes available. This is the core benefit of real-time mode - the ability to see and act on data immediately rather than waiting for batch processing.
Additional resources
Now that you've run your first real-time query, explore these resources to build production streaming applications with Kafka, Kinesis, and other supported sources: