Poznámka
Na prístup k tejto stránke sa vyžaduje oprávnenie. Môžete sa skúsiť prihlásiť alebo zmeniť adresáre.
Na prístup k tejto stránke sa vyžaduje oprávnenie. Môžete skúsiť zmeniť adresáre.
This page provides reference information for real-time mode in Structured Streaming, including supported environments, languages, sources, sinks, and operators. For known limitations, see Real-time mode limitations.
Supported languages
Real-time mode supports Scala, Java, and Python.
Compute types
Real-time mode supports the following compute types:
| Compute type | Supported |
|---|---|
| Dedicated (formerly: single user) | ✓ |
| Standard (formerly: shared) | ✓ (only Python) |
| Lakeflow Spark Declarative Pipelines Classic | Not supported |
| Lakeflow Spark Declarative Pipelines Serverless | Not supported |
| Serverless | Not supported |
Execution modes
Real-time mode supports update mode only:
| Execution mode | Supported |
|---|---|
| Update mode | ✓ |
| Append mode | Not supported |
| Complete mode | Not supported |
Sources and sinks
Real-time mode supports the following sources and sinks:
| Source or sink | As source | As sink |
|---|---|---|
| Apache Kafka | ✓ | ✓ |
| Event Hubs (using Kafka connector) | ✓ | ✓ |
| Kinesis | ✓ (only EFO mode) | Not supported |
| AWS MSK | ✓ | Not supported |
| Delta | Not supported | Not supported |
| Google Pub/Sub | Not supported | Not supported |
| Apache Pulsar | Not supported | Not supported |
Arbitrary sinks (using forEachWriter) |
Not applicable | ✓ |
Operators
Real-time mode supports most Structured Streaming operators:
Stateless operations
| Operator | Supported |
|---|---|
| Selection | ✓ |
| Projection | ✓ |
UDFs
| Operator | Supported |
|---|---|
| Scala UDF | ✓ (with some limitations) |
| Python UDF | ✓ (with some limitations) |
Aggregation
| Operator | Supported |
|---|---|
| sum | ✓ |
| count | ✓ |
| max | ✓ |
| min | ✓ |
| avg | ✓ |
| Aggregation functions | ✓ |
Windowing
| Operator | Supported |
|---|---|
| Tumbling | ✓ |
| Sliding | ✓ |
| Session | Not supported |
Deduplication
| Operator | Supported |
|---|---|
| dropDuplicates | ✓ (the state is unbounded) |
| dropDuplicatesWithinWatermark | Not supported |
Stream to table join
| Operator | Supported |
|---|---|
| Broadcast table join (table should be small) | ✓ |
| Stream to stream join | Not supported |
| (flat)MapGroupsWithState | Not supported |
| transformWithState | ✓ (with some differences) |
| union | ✓ (with some limitations) |
| forEach | ✓ |
| forEachBatch | Not supported |
| mapPartitions | Not supported (see limitation) |
Special considerations
Some operators and features have specific considerations or differences when used in real-time mode.
transformWithState in real-time mode
For building custom stateful applications, Databricks supports transformWithState, an API in Apache Spark Structured Streaming. See Build a custom stateful application for more information about the API and code snippets.
However, there are some differences between how the API behaves in real-time mode and traditional streaming queries that leverage the micro-batch architecture.
- Real-time mode calls the
handleInputRows(key: String, inputRows: Iterator[T], timerValues: TimerValues)method for each row.- The
inputRowsiterator returns a single value. Micro-batch mode calls it once for each key, and theinputRowsiterator returns all values for a key in the micro batch. - Account for this difference when writing your code
- The
- Event time timers are not supported in real-time mode.
- In real-time mode, timers are delayed in firing depending on data arrival:
- If a timer is scheduled for 10:00:00 but no data arrives, the timer doesn't fire immediately.
- If data arrives at 10:00:10, the timer fires with a 10-second delay.
- If no data arrives and the long-running batch is terminating, the timer fires before the batch terminates.
Python UDFs in real-time mode
Databricks supports the majority of Python user-defined functions (UDFs) in real-time mode:
Stateless
| UDF type | Supported |
|---|---|
| Python scalar UDF (User-defined scalar functions - Python) | ✓ |
| Arrow scalar UDF | ✓ |
| Pandas scalar UDF (pandas user-defined functions) | ✓ |
Arrow function (mapInArrow) |
✓ |
| Pandas function (Map) | ✓ |
Stateful grouping (UDAF)
| UDF type | Supported |
|---|---|
transformWithState (only Row interface) |
✓ |
applyInPandasWithState |
Not supported |
Non-stateful grouping (UDAF)
| UDF type | Supported |
|---|---|
apply |
Not supported |
applyInArrow |
Not supported |
applyInPandas |
Not supported |
Table functions
| UDF type | Supported |
|---|---|
| UDTF (Python user-defined table functions (UDTFs)) | Not supported |
| UC UDF | Not supported |
There are several points to consider when using Python UDFs in real-time mode:
- To minimize the latency, configure the Arrow batch size (
spark.sql.execution.arrow.maxRecordsPerBatch) to 1.- Trade-off: This configuration optimizes for latency at the expense of throughput. For most workloads, this setting is recommended.
- Increase the batch size only if a higher throughput is required to accommodate input volume, accepting the potential increase in latency.
- Pandas UDFs and functions do not perform well with an Arrow batch size of 1.
- If you use pandas UDFs or functions, set the Arrow batch size to a higher value (for example, 100 or higher).
- This implies higher latency. Databricks recommends using an Arrow UDF or function if possible.
- Due to the performance issue with pandas, transformWithState is only supported with the
Rowinterface.