Share via


Real-time mode reference

This page provides reference information for real-time mode in Structured Streaming, including supported environments, languages, sources, sinks, and operators. For known limitations, see Real-time mode limitations.

Supported languages

Real-time mode supports Scala, Java, and Python.

Compute types

Real-time mode supports the following compute types:

Compute type Supported
Dedicated (formerly: single user)
Standard (formerly: shared) ✓ (only Python)
Lakeflow Spark Declarative Pipelines Classic Not supported
Lakeflow Spark Declarative Pipelines Serverless Not supported
Serverless Not supported

Execution modes

Real-time mode supports update mode only:

Execution mode Supported
Update mode
Append mode Not supported
Complete mode Not supported

Sources and sinks

Real-time mode supports the following sources and sinks:

Source or sink As source As sink
Apache Kafka
Event Hubs (using Kafka connector)
Kinesis ✓ (only EFO mode) Not supported
AWS MSK Not supported
Delta Not supported Not supported
Google Pub/Sub Not supported Not supported
Apache Pulsar Not supported Not supported
Arbitrary sinks (using forEachWriter) Not applicable

Operators

Real-time mode supports most Structured Streaming operators:

Stateless operations

Operator Supported
Selection
Projection

UDFs

Operator Supported
Scala UDF ✓ (with some limitations)
Python UDF ✓ (with some limitations)

Aggregation

Operator Supported
sum
count
max
min
avg
Aggregation functions

Windowing

Operator Supported
Tumbling
Sliding
Session Not supported

Deduplication

Operator Supported
dropDuplicates ✓ (the state is unbounded)
dropDuplicatesWithinWatermark Not supported

Stream to table join

Operator Supported
Broadcast table join (table should be small)
Stream to stream join Not supported
(flat)MapGroupsWithState Not supported
transformWithState ✓ (with some differences)
union ✓ (with some limitations)
forEach
forEachBatch Not supported
mapPartitions Not supported (see limitation)

Special considerations

Some operators and features have specific considerations or differences when used in real-time mode.

transformWithState in real-time mode

For building custom stateful applications, Databricks supports transformWithState, an API in Apache Spark Structured Streaming. See Build a custom stateful application for more information about the API and code snippets.

However, there are some differences between how the API behaves in real-time mode and traditional streaming queries that leverage the micro-batch architecture.

  • Real-time mode calls the handleInputRows(key: String, inputRows: Iterator[T], timerValues: TimerValues) method for each row.
    • The inputRows iterator returns a single value. Micro-batch mode calls it once for each key, and the inputRows iterator returns all values for a key in the micro batch.
    • Account for this difference when writing your code
  • Event time timers are not supported in real-time mode.
  • In real-time mode, timers are delayed in firing depending on data arrival:
    • If a timer is scheduled for 10:00:00 but no data arrives, the timer doesn't fire immediately.
    • If data arrives at 10:00:10, the timer fires with a 10-second delay.
    • If no data arrives and the long-running batch is terminating, the timer fires before the batch terminates.

Python UDFs in real-time mode

Databricks supports the majority of Python user-defined functions (UDFs) in real-time mode:

Stateless

UDF type Supported
Python scalar UDF (User-defined scalar functions - Python)
Arrow scalar UDF
Pandas scalar UDF (pandas user-defined functions)
Arrow function (mapInArrow)
Pandas function (Map)

Stateful grouping (UDAF)

UDF type Supported
transformWithState (only Row interface)
applyInPandasWithState Not supported

Non-stateful grouping (UDAF)

UDF type Supported
apply Not supported
applyInArrow Not supported
applyInPandas Not supported

Table functions

UDF type Supported
UDTF (Python user-defined table functions (UDTFs)) Not supported
UC UDF Not supported

There are several points to consider when using Python UDFs in real-time mode:

  • To minimize the latency, configure the Arrow batch size (spark.sql.execution.arrow.maxRecordsPerBatch) to 1.
    • Trade-off: This configuration optimizes for latency at the expense of throughput. For most workloads, this setting is recommended.
    • Increase the batch size only if a higher throughput is required to accommodate input volume, accepting the potential increase in latency.
  • Pandas UDFs and functions do not perform well with an Arrow batch size of 1.
    • If you use pandas UDFs or functions, set the Arrow batch size to a higher value (for example, 100 or higher).
    • This implies higher latency. Databricks recommends using an Arrow UDF or function if possible.
  • Due to the performance issue with pandas, transformWithState is only supported with the Row interface.