Does sending an event with a specific partition key in Azure Event Hubs break multi-AZ resiliency?

Question

Does sending an event with a specific partition key in Azure Event Hubs break multi-AZ resiliency?

Andrew Citera 45

I'm trying to understand at a lower level of detail how Azure Event Hubs behave when there is an availability zone outage. I know that topic partitions are replicated across three availability zones and there is a service fabric model under the hood that elects a leader and that an event producer doesn't receive a successful acknowledge until replication has occurred. I also know that when a partition key isn't specified Azure Event Hubs writes to available partitions in a round robin fashion thus improving the availability.

My question is specifically how Azure Event Hubs handles recovery if an application does indeed need to provide a specific partition key. I understand that if this partition is unavailable and that partition key is supplied then this would result in an error because the event hub gateway would prevent it from being written to an unavailable partition; however, where I'm not able to find specific details is would that partition eventually recover and would it have the same key?

Take the following example (assume the Event Hub is multi-AZ enabled):

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is available --> Success

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is unavailable --> Failure

[What happens here? How does event hubs recover partition 0 and bring it back online?]

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is available --> Success or Failure? Does the partition maintain the same ID?

I assume some of the retry logic is handled by what is baked in the SDK, but the documentation isn't clear if the partition would eventually recover or if specifying a partition key completely breaks high availability. The following document snippets feel like they don't make sense e.g. is partition ID vs. high availability truly a complete tradeoff or is it just that the availability is reduced by event hubs is still going to recover that partition?

[1]https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-availability-and-consistency?tabs=dotnet#:~:text=Therefore%2C%20if%20high,see%20Partitions.

[2] https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/event-hubs/partitioning-in-event-hubs-and-kafka#:~:text=With%20Kafka%2C%20if,to%20unavailable%20partitions.

Andrew Citera 45 Reputation points

2024-05-13T17:36:03.0466667+00:00

Thank you Annu this was super helpful! It may be helpful to others if the documentation page on availability and consistency could be updated to reflect the clearer language you provided here.

In my case specifically I was interested in availability during outages not disasters, so it was useful to know the partition ID is preserved and that the cluster will eventually recover from these type of issues. I'm assuming the documentation saying partitions are choice between availability and consistency was intending more to opine on total probabilistic uptime rather than it being a binary choice? I definitely recognize that uptime would expected to be higher when not specifying a key because the gateway will round-robin choose an available partition thus increasing the probability of a healthy partition (over time).

Appreciate the response. Feel free to mark this as resolved.
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-05-13T17:47:23.5266667+00:00

Andrew Citera ,

Thankyou for acknowledging . Glad to know the explaination helped. Yes, If high availability is the primary concern, it is recommended to use the round-robin partitioning strategy to ensure that data is evenly distributed across all available partitions.

Kindly feel free to accept the answer by clicking on Accept answer button. Thankyou.

Accepted answer

0 additional answers

Your answer

Andrew Citera 45 Reputation points

2024-05-13T17:36:03.0466667+00:00

Thank you Annu this was super helpful! It may be helpful to others if the documentation page on availability and consistency could be updated to reflect the clearer language you provided here.

In my case specifically I was interested in availability during outages not disasters, so it was useful to know the partition ID is preserved and that the cluster will eventually recover from these type of issues. I'm assuming the documentation saying partitions are choice between availability and consistency was intending more to opine on total probabilistic uptime rather than it being a binary choice? I definitely recognize that uptime would expected to be higher when not specifying a key because the gateway will round-robin choose an available partition thus increasing the probability of a healthy partition (over time).

Appreciate the response. Feel free to mark this as resolved.
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-05-13T17:47:23.5266667+00:00

Andrew Citera ,

Thankyou for acknowledging . Glad to know the explaination helped. Yes, If high availability is the primary concern, it is recommended to use the round-robin partitioning strategy to ensure that data is evenly distributed across all available partitions.

Kindly feel free to accept the answer by clicking on Accept answer button. Thankyou.

Answer 1

Hi Andrew Citera ,

Welcome to Micrososft Q&A platform and thanks for your query here.

I understand that you are trying to understand how Azure Event Hubs behave when there is an availability zone outage and how it handles recovery.

It's important to note the distinction between "outages" and "disasters." An outage is the temporary unavailability of Azure Event Hubs, and can affect some components of the service, such as a messaging store, or even the entire datacenter. However, after the problem is fixed, Event Hubs becomes available again. Typically, an outage doesn't cause the loss of messages or other data. An example of such an outage might be a power failure in the datacenter. Some outages are only short connection losses because of transient or network issues.

A disaster is defined as the permanent, or longer-term loss of an Event Hubs cluster, Azure region, or datacenter. The region or datacenter may or may not become available again, or may be down for hours or days , so the Geo-disaster recovery feature of Azure Event Hubs is a disaster recovery solution.

When an availability zone outage occurs in Azure Event Hubs, the service fabric model under the hood will elect a new leader for the affected partition(s) in another availability zone. The new leader will then begin replicating data to the other two availability zones to ensure that the data is fully replicated across all three zones.

If an application provides a specific partition key and that partition is unavailable, the Event Hub gateway will prevent the data from being written to an unavailable partition. The application will receive an error indicating that the partition is unavailable. The partition will eventually recover and become available again, at which point the application can retry writing to that partition with the same partition key. When a partition recovers, it will maintain the same partition ID. The partition ID is a unique identifier for the partition and is not affected by availability zone outages or other issues.

It is important to note that specifying a partition key does not completely break high availability in Azure Event Hubs. The partition key is used to ensure that related events are written to the same partition, which can improve performance and ordering guarantees. However, if a partition is unavailable, the Event Hub gateway will prevent data from being written to that partition until it becomes available again. This can result in reduced availability for the affected partition(s), but the overall availability of the Event Hub is not affected.

Hope it helps. Thankyou

Share via

Does sending an event with a specific partition key in Azure Event Hubs break multi-AZ resiliency?

0 additional answers

Your answer