Got 1-2 record delay when streaming from delta table to delta table in azure databricks

Sergey A. Volkov 21 Reputation points
2022-06-07T15:59:57.333+00:00

I'm using structured streaming with delta lake table sink and delta lake table source.

I have noticed that there is 1 or 2 record delay in output table (regardless of passed time).

Looks like there is some sort of cache/buffer in streaming output. Could I somehow eliminate this delay?

This is how table was created and how I add data to input

%sql  
create table input_test (id string, date timestamp);  
insert into input_test (id, date) values ("1", current_timestamp());  

Here is my streaming code:

%scala  
import org.apache.spark.sql.functions._  
val stream = spark.readStream  
         .format("delta")  
         .table("input_test")  
         .withWatermark("date", "10 seconds")  
         .groupBy($"id", window($"date", "1 minute", "1 minute"))  
         .count()  
      
display(stream)  
      
stream.writeStream  
         .format("delta")  
         .outputMode("append")  
         .option("checkpointLocation", "/tmp/_cp")   
         .toTable("output_test")  

Display shows changes immediately, but when selecting output table I always have 1 or 2 record lag.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,483 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sergey A. Volkov 21 Reputation points
    2022-06-17T14:58:37.5+00:00

    I think it's because spark streaming use event time based watermark rather then processing time watermark.

    See https://stackoverflow.com/questions/46032001/is-watermark-based-on-processing-time-or-event-time-or-both

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.