Does sending an event with a specific partition key in Azure Event Hubs break multi-AZ resiliency?

Andrew Citera 45 Reputation points
2024-05-09T17:48:49.06+00:00

I'm trying to understand at a lower level of detail how Azure Event Hubs behave when there is an availability zone outage. I know that topic partitions are replicated across three availability zones and there is a service fabric model under the hood that elects a leader and that an event producer doesn't receive a successful acknowledge until replication has occurred. I also know that when a partition key isn't specified Azure Event Hubs writes to available partitions in a round robin fashion thus improving the availability.

My question is specifically how Azure Event Hubs handles recovery if an application does indeed need to provide a specific partition key. I understand that if this partition is unavailable and that partition key is supplied then this would result in an error because the event hub gateway would prevent it from being written to an unavailable partition; however, where I'm not able to find specific details is would that partition eventually recover and would it have the same key?

Take the following example (assume the Event Hub is multi-AZ enabled):

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is available --> Success

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is available --> Success

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is unavailable --> Failure

[What happens here? How does event hubs recover partition 0 and bring it back online?]

Producer A writes to Topic A Partition 0 with Partition Key ID 0 and Partition 0 is available --> Success or Failure? Does the partition maintain the same ID?

I assume some of the retry logic is handled by what is baked in the SDK, but the documentation isn't clear if the partition would eventually recover or if specifying a partition key completely breaks high availability. The following document snippets feel like they don't make sense e.g. is partition ID vs. high availability truly a complete tradeoff or is it just that the availability is reduced by event hubs is still going to recover that partition?

[1]https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-availability-and-consistency?tabs=dotnet#:~:text=Therefore%2C%20if%20high,see%20Partitions.

[2] https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/event-hubs/partitioning-in-event-hubs-and-kafka#:~:text=With%20Kafka%2C%20if,to%20unavailable%20partitions.

Azure Event Hubs
Azure Event Hubs
An Azure real-time data ingestion service.
568 questions
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 31,721 Reputation points Microsoft Employee
    2024-05-13T17:26:34.2533333+00:00

    Hi Andrew Citera ,

    Welcome to Micrososft Q&A platform and thanks for your query here.

    I understand that you are trying to understand how Azure Event Hubs behave when there is an availability zone outage and how it handles recovery.

    It's important to note the distinction between "outages" and "disasters." An outage is the temporary unavailability of Azure Event Hubs, and can affect some components of the service, such as a messaging store, or even the entire datacenter. However, after the problem is fixed, Event Hubs becomes available again. Typically, an outage doesn't cause the loss of messages or other data. An example of such an outage might be a power failure in the datacenter. Some outages are only short connection losses because of transient or network issues.

    A disaster is defined as the permanent, or longer-term loss of an Event Hubs cluster, Azure region, or datacenter. The region or datacenter may or may not become available again, or may be down for hours or days , so the Geo-disaster recovery feature of Azure Event Hubs is a disaster recovery solution.

    When an availability zone outage occurs in Azure Event Hubs, the service fabric model under the hood will elect a new leader for the affected partition(s) in another availability zone. The new leader will then begin replicating data to the other two availability zones to ensure that the data is fully replicated across all three zones.

    If an application provides a specific partition key and that partition is unavailable, the Event Hub gateway will prevent the data from being written to an unavailable partition. The application will receive an error indicating that the partition is unavailable. The partition will eventually recover and become available again, at which point the application can retry writing to that partition with the same partition key. When a partition recovers, it will maintain the same partition ID. The partition ID is a unique identifier for the partition and is not affected by availability zone outages or other issues.

    It is important to note that specifying a partition key does not completely break high availability in Azure Event Hubs. The partition key is used to ensure that related events are written to the same partition, which can improve performance and ordering guarantees. However, if a partition is unavailable, the Event Hub gateway will prevent data from being written to that partition until it becomes available again. This can result in reduced availability for the affected partition(s), but the overall availability of the Event Hub is not affected.

    Hope it helps. Thankyou

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful