I think it's because spark streaming use event time based watermark rather then processing time watermark.
Got 1-2 record delay when streaming from delta table to delta table in azure databricks
Sergey A. Volkov
21
Reputation points
I'm using structured streaming with delta lake table sink and delta lake table source.
I have noticed that there is 1 or 2 record delay in output table (regardless of passed time).
Looks like there is some sort of cache/buffer in streaming output. Could I somehow eliminate this delay?
This is how table was created and how I add data to input
%sql
create table input_test (id string, date timestamp);
insert into input_test (id, date) values ("1", current_timestamp());
Here is my streaming code:
%scala
import org.apache.spark.sql.functions._
val stream = spark.readStream
.format("delta")
.table("input_test")
.withWatermark("date", "10 seconds")
.groupBy($"id", window($"date", "1 minute", "1 minute"))
.count()
display(stream)
stream.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/_cp")
.toTable("output_test")
Display shows changes immediately, but when selecting output table I always have 1 or 2 record lag.
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,483 questions