EventHub rate limit with Trigger Once

Diego Poggioli 1 Reputation point
2022-07-15T09:39:14.143+00:00

In Databricks we are using the com.microsoft.azure:azure-eventhubs-spark connector to receive streaming events from EventHub queues.

Since it is not real-time but the cluster is started twice a day we used the Trigger.Once () option to read the messages accumulated in the queues and then allow the cluster to shut down once all the messages have been processed .
However, we realized that only about 7000 events were processed at each start even though there were many more messages in the queue.

By setting the parameter setMaxEventsPerTrigger at a very high value, many more messages are ingested.

It looks like the default maxEventsPerTrigger rate limit is 1000 (Azure Event Hubs - Azure Databricks | Microsoft Learn) but is this true even when this parameter is not set?

Is this solution the right way to go?

Same issues:
amazon s3 - Spark Structured Streaming writeStream trigger set to once is recording much less data than it should - Stack Overflow
spark structured streaming - Trigger.Once with Azure Event Hubs - Stack Overflow
(Question) regarding .setMaxEventsPerTrigger() and trigger(Trigger.Once) · Issue #472 · Azure/azure-event-hubs-spark · GitHub221083-1.png

Thanks
Diego

Azure Event Hubs
Azure Event Hubs
An Azure real-time data ingestion service.
601 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,080 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vidhya Sagar Karthikeyan 396 Reputation points
    2022-07-15T17:16:32.197+00:00

    Hi @Diego Poggioli ,

    I think you are misunderstood the 1000 events per second. I believe this is applicable only for ingress operation, it's either 1000 events or 1 MB per API call to event hub. However, on the other had for egress it's up to max 2 MB per call (it will wrap N number of events to round to 2 MB per call) and there is no upper limit for number of events to read. In your case, I assume 7000 events approx. size is 2 MB. Since its trigger once, when the job kick-starts, it was able to pull only 7000 events and ends there.

    Saying that, setting maxEventsPerTrigger value to a higher number will let it create micro batch (each batch is less than 2 MB) behind the scene and send the events to the caller. However, in this case, it's not always guaranteed that you will have only 10000000 events, what if you have more than N number of events?

    I think to overcome this I would still keep the trigger as once, however instead of letting the SDK do the batching for you, you can have the batching logic in your code. When it's triggered, pull X number of events and loop through until it's lower than a batch size and end the logic there. If you want to keep the dev simple, then playing around the attribute maxEventsPerTrigger will help.

    Mark this as an accepted answer if it helps you.

    0 comments No comments