Hi @Janice Chi Trigger Design in Databricks Structured Streaming: In most production setups, I’ve had the best results using a time-based trigger, like trigger(processingTime="1 minute")
. It gives a good trade-off between low latency and stability—especially when you're dealing with JSON payloads from CDC like you are.
Record-count-based triggers aren’t really supported natively in Structured Streaming, and they tend to be harder to control at scale anyway. A one-minute interval gives your cluster time to process batches predictably, and it plays nicely with autoscaling if you’re using it.
If you notice lag or batch processing getting spiky, you can always fine-tune this, but a minute is a solid place to start.
Reconciliation Before Writing to Azure SQL: Yes, foreachBatch
is the right place to do in-batch reconciliation. Inside that function, you can validate counts, do hash checks, or even compare expected vs. actual rows before pushing to Azure SQL.
Keep that logic efficient. Try to avoid joins or lookups that could slow the batch down. If you log audit results (e.g., counts, checksums) into a side table or log store, it’ll help you track issues without blocking the stream.
Just make sure whatever reconciliation you do stays within the same micro-batch this way you don’t break consistency when writing to Hyperscale.
Handling Late CDC Records After Domain Cutover: This is a common headache especially when you cut over Domain A, but Kafka still spits out some lagged messages.
Best options here:
- Deduplicate by primary key + timestamp before writing.
- Watermarking can help drop messages that are too late, though you need to balance latency tolerance.
- Add cutover logic in your stream you can reference a "domain state" table (e.g., Domain A = live) and drop/ignore any old records that shouldn’t be processed anymore.
You don’t need to pause the stream during cutover. It’s better to handle this through logic build your stream to dynamically check cutover state and filter accordingly.
Writing to Azure SQL Hyperscale with Minimal Latency: In streaming scenarios, writing directly into your main tables can be risky especially if reconciliation or merge logic is still needed. A safer and more scalable pattern is:
- Write to a staging table.
- Run a stored proc or merge operation into your target tables.
This way, you can retry failed batches, audit changes, or even roll back bad loads more cleanly.
If you do choose to write directly into the main tables, just make sure you’ve already done all your business rules, transformations, and deduplication upstream. Also tune your JDBC settings (batch size, retries, etc.) so you’re not bottlenecked at write time.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.