Making Service Broker Application Highly Available With AlwaysOn

Articolo
01/26/2017

If you have a Service Broker (SSB) application connecting to SQL Server using an AlwaysOn availability group listener, in the event of an unexpected failover, some messages may be lost or stuck in the transmission queue on the old primary (new secondary) after the failover. This could be an automatic failover or manual failover when the SSB application is not stopped.

Incoming (SSB) connections are severed during a availability group failover and those connections use the listener IP address which directs connections to the new primary replica. New incoming SSB connections, established through the listener, will go to the new replica. This is the correct and intended behavior.

The availability group fails over and the new inbound SSB connections reset to the new primary replica:

However, outgoing SSB connections from the old primary replica are left open after a failover, still in session with the initiator, and remain so because the IP addresses are still valid. New outgoing SSB connections are established from the new replica to the initiator, but because the connections from the old primary are still open, incoming SSB messages are still sent to the old primary.

Since Service Broker components are not contained within the availability group, special handling is required to handle the Service Broker connections when the availability group moves to a different replica instance. If you feel the post failover handling for the Service Broker connection is not feasible to be built into your application, please submit a Connect item (https://connect.microsoft.com/sql) so that we are able to understand the community need for this functionality. As an alternative to modifying the code for your application, you could choose to use a SQL Server Failover Cluster Instance which will prevent the issue described above, due to the IP Address remaining the same on failover.

UPDATE: Enhanced Compatibility When Hosting Service Broker in Availability Groups

This behavior has been improved in the following versions of SQL Server:

SQL 2014 SP1/CU12 & SP2/CU5

SQL 2016 RTM/CU6 & SP1/CU3

SQL 2012 SP4

For more information on these improvements, see the following KnowledgeBase article:

FIX: Service Broker endpoint connections aren't closed after an availability group failover in SQL Server

Troubleshooting

If, after the failover, you run the following query on the old primary SSB TARGET and see the value for is_accepted = 0 then you know you have encountered the issue. You will also likely have two or more connections returned here.

--look for is_accept = 0

select is_accept, * from sys.dm_broker_connections with(nolock)

Another side effect are messages stuck in the sys.transmission_queue on the old primary SSB TARGET.

If Profiler trace is run on the old primary SSB TARGET, you would see the following error:

The message has been dropped because the service broker in the target database is unavailable: ‘The database is in read-only mode.’

SQL Server XEvent ‘broker_message_undeliverable’ can also be collected using SQL Server extended events and will also report that messages cannot be delivered. For more information on using SQL Server extended events, see:

Create an Extended Events Session Using the Wizard (Object Explorer)

Workaround

Host the primary replica on a SQL Failover Cluster Instance (SQL FCI). In the event of a failover, SQL Server restarts and the initiator will re-establish connectivity with the availability group listener and the TARGET will re-establish connections with the initiator. There will be no residual broker connections or in-flight transactions that cannot be resolved.

In this scenario, you still have a secondary replica hosted on a standalone instance of SQL Server which provides another means of high availability.

Maintenance

When failing manually over between availability group replicas for rolling upgrade purposes, stop the Service Broker Application before doing the manual availability group failover to prevent any in-flight transactions from getting stuck.

Comments

Anonymous
March 09, 2017
https://connect.microsoft.com/SQLServer/Feedback/Details/3102417
Anonymous
March 13, 2017
Stopping and Starting the Endpoints as suggested in the workarounds on the connect item on failover ( using an alert ) work fine.
- Anonymous
  March 13, 2017
  Not always, only if at the moment of failower the queue (sys.transmission_queue) was empty
  - Anonymous
    March 17, 2017
    As we have now found out :)
Anonymous
April 20, 2017
Did this just get fixed in SQL 2014 SP1/CU12 & SP2/CU5?https://support.microsoft.com/en-us/help/4016361Seems to describe the issue above so I think so.
- Anonymous
  April 22, 2017
  Yes, fixed!!!
  - Anonymous
    April 22, 2017
    Hurried, don't working
    - Anonymous
      April 24, 2017
      That's a shame. It does seem that this fix describes this exact problem though?
    - Anonymous
      May 02, 2017
      Looking at the Connect item you raised it seems you are using "Microsoft SQL Server 2016" and the fix I linked to relates to "SQL 2014 SP1/CU12 & SP2/CU5" so might explain why you're not seeing the fix?
- Anonymous
  May 17, 2017
  That fix just got updated to include SQL 2016 RTM/CU6 & SP1/CU3.
  - Anonymous
    June 03, 2017
    Yes!!! Fixed - Microsoft SQL Server 2016 (SP1-CU3)
    - Anonymous
      July 30, 2017
      I installed SQLSERVER2016 SP1-CU3 FOR Primary Node and Secondary Node, the issue is still the same for me

Condividi tramite