Making Service Broker Application Highly Available With AlwaysOn
If you have a Service Broker (SSB) application connecting to SQL Server using an AlwaysOn availability group listener, in the event of an unexpected failover, some messages may be lost or stuck in the transmission queue on the old primary (new secondary) after the failover. This could be an automatic failover or manual failover when the SSB application is not stopped.
Incoming (SSB) connections are severed during a availability group failover and those connections use the listener IP address which directs connections to the new primary replica. New incoming SSB connections, established through the listener, will go to the new replica. This is the correct and intended behavior.
The availability group fails over and the new inbound SSB connections reset to the new primary replica:
However, outgoing SSB connections from the old primary replica are left open after a failover, still in session with the initiator, and remain so because the IP addresses are still valid. New outgoing SSB connections are established from the new replica to the initiator, but because the connections from the old primary are still open, incoming SSB messages are still sent to the old primary.
Since Service Broker components are not contained within the availability group, special handling is required to handle the Service Broker connections when the availability group moves to a different replica instance. If you feel the post failover handling for the Service Broker connection is not feasible to be built into your application, please submit a Connect item (https://connect.microsoft.com/sql) so that we are able to understand the community need for this functionality. As an alternative to modifying the code for your application, you could choose to use a SQL Server Failover Cluster Instance which will prevent the issue described above, due to the IP Address remaining the same on failover.
UPDATE: Enhanced Compatibility When Hosting Service Broker in Availability Groups
This behavior has been improved in the following versions of SQL Server:
SQL 2014 SP1/CU12 & SP2/CU5
SQL 2016 RTM/CU6 & SP1/CU3
SQL 2012 SP4
For more information on these improvements, see the following KnowledgeBase article:
Troubleshooting
If, after the failover, you run the following query on the old primary SSB TARGET and see the value for is_accepted = 0 then you know you have encountered the issue. You will also likely have two or more connections returned here.
--look for is_accept = 0
select is_accept, * from sys.dm_broker_connections with(nolock)
Another side effect are messages stuck in the sys.transmission_queue on the old primary SSB TARGET.
If Profiler trace is run on the old primary SSB TARGET, you would see the following error:
The message has been dropped because the service broker in the target database is unavailable: ‘The database is in read-only mode.’
SQL Server XEvent ‘broker_message_undeliverable’ can also be collected using SQL Server extended events and will also report that messages cannot be delivered. For more information on using SQL Server extended events, see:
Create an Extended Events Session Using the Wizard (Object Explorer)
Workaround
Host the primary replica on a SQL Failover Cluster Instance (SQL FCI). In the event of a failover, SQL Server restarts and the initiator will re-establish connectivity with the availability group listener and the TARGET will re-establish connections with the initiator. There will be no residual broker connections or in-flight transactions that cannot be resolved.
In this scenario, you still have a secondary replica hosted on a standalone instance of SQL Server which provides another means of high availability.
Maintenance
When failing manually over between availability group replicas for rolling upgrade purposes, stop the Service Broker Application before doing the manual availability group failover to prevent any in-flight transactions from getting stuck.
Comments
- Anonymous
March 09, 2017
https://connect.microsoft.com/SQLServer/Feedback/Details/3102417 - Anonymous
March 13, 2017
Stopping and Starting the Endpoints as suggested in the workarounds on the connect item on failover ( using an alert ) work fine.- Anonymous
March 13, 2017
Not always, only if at the moment of failower the queue (sys.transmission_queue) was empty- Anonymous
March 17, 2017
As we have now found out :)
- Anonymous
- Anonymous
- Anonymous
April 20, 2017
Did this just get fixed in SQL 2014 SP1/CU12 & SP2/CU5?https://support.microsoft.com/en-us/help/4016361Seems to describe the issue above so I think so.- Anonymous
April 22, 2017
Yes, fixed!!!- Anonymous
April 22, 2017
Hurried, don't working- Anonymous
April 24, 2017
That's a shame. It does seem that this fix describes this exact problem though? - Anonymous
May 02, 2017
Looking at the Connect item you raised it seems you are using "Microsoft SQL Server 2016" and the fix I linked to relates to "SQL 2014 SP1/CU12 & SP2/CU5" so might explain why you're not seeing the fix?
- Anonymous
- Anonymous
- Anonymous
May 17, 2017
That fix just got updated to include SQL 2016 RTM/CU6 & SP1/CU3.- Anonymous
June 03, 2017
Yes!!! Fixed - Microsoft SQL Server 2016 (SP1-CU3)- Anonymous
July 30, 2017
I installed SQLSERVER2016 SP1-CU3 FOR Primary Node and Secondary Node, the issue is still the same for me
- Anonymous
- Anonymous
- Anonymous