An Apache Spark-based analytics platform optimized for Azure.
Hi Janice Chi Based on your setup, the MERGE performance is more tied to the volume per table than the number of topics. So when you split 200K events/min across multiple topics and each table is only handling ~20K rows/min, the MERGE itself should complete faster. The catch is that running 200 separate MERGEs every minute adds a lot of concurrent transactional work.
So you’ll start seeing pressure on the transaction log and CPU rather than just row counts.
Hyperscale can deal with high storage and throughput, but there are limits on log rate and worker threads. Having 200 JDBC writers pushing every minute is on the high side, and in practice most designs consolidate data into staging/landing tables first and then MERGE in larger, controlled batches. That pattern tends to be more stable and avoids log bottlenecks.
- Use a staging + MERGE pattern instead of direct per-topic MERGEs (recommended in Microsoft guidance).
- Tune your JDBC writes from Databricks with options like
batchsizeso data is sent in chunks, not row-by-row. - Keep an eye on Hyperscale log rate and CPU usage. Useful reference here: Hyperscale performance diagnostics.
- If needed, scale up compute or log IO capacity to absorb the workload.
Finally: Smaller per-table volumes will help, but the real risk is from the number of concurrent writers. If you do run into stability issues, moving to a staging approach and reducing the number of MERGE operations per minute is usually the way to go
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.