Incorrect Deployment of Database Replica in SCCM 2012 May Cause Client “Fast Channel” Notification Problems

Recently, I learned about an issue in which customers running SCCM 2012 SP1 and above  might receive large numbers of .BOS files in the bgb.box/bad folder on their Primary Site server.   BGB is more commonly known as Client “Fast Channel” Notification and it’s a new feature starting with SCCM 2012 SP1.  If you’re having this problem, hopefully the information below will help you understand what is going on.

Background

As explained in an excellent blog article that can be found here, ConfigMgr 2012 has a system known as “Fast Channel” notification. The purpose of this feature is to allow for ConfigMgr to notify clients of time-sensitive tasks like changes in client policy. Prior to this, changes destined for clients which were put into ConfigMgr would experience potentially long delays before clients became aware of them. Now, clients can be made aware of these changes very quickly through “Fast Notification”.

In the Fast Notification system, there are 3 components that are of relevance:

  1. Notification Manager – hosted by the site server, it generates push messages for clients and stores the results of these messages (along with client status) in the site database
  2. Notification Server – automatically installed on every Management Point. It listens (using the SQL Service Broker queue) for push messages being sent to the clients by the Notification Manager via TCP and HTTP listeners.
    1. The Notification Server also sends notification files (*.BOS files) with information about client push activities to be stored in the site server database
  3. Notification Agent – a component of CCMEXEC that establishes a persistent connection with its Notification Server (the MP) and uses that connection to receive push messages and send status information to the Notification Agent

The system typically works very well with no intervention being necessary. When the MP role is configured, the Notification Server component is set up automatically. When the CCM client is initialized, the Notification Agent is also initialized automatically. As long as there are no problems, Fast Notification functions with most administrators not even knowing it’s there. But in at least one case, problems can be introduced that make intervention necessary.

The Problem

Typically, the Fast Notification system works behind the scenes with no one needing to do anything to keep it working. But there is at least one scenario in which this system can become broken. In this case, there will be a growing number of online status notification files (files with a *.BOS extension) placed in the BGB.BOX\BAD folder on the site server and error messages appearing in bgbmgr.log. Depending on the size of the environment, this may be thousands of bad files and an equal number of error messages in the log.

The Cause

In ConfigMgr 2012 it is possible to create a SQL database replica of the site server database on a remote Management Point. This may be done to enhance performance or allow for fault tolerance in an environment. But if it’s not done correctly, the problems mentioned above may result.

One of the essential components of the Fast Notification system is that the Management Point (the Notification Server, specifically) needs to report information back to the site database maintained by the Notification Manager. When a SQL Replica is being configured, the Management Point begins to point toward the replica database rather than the site server database. The interface where this is configured in the ConfigMgr Admin Console is shown below:

clip_image001

If the configuration of the SQL Replica on the Management Point is incomplete or has not been done properly (if the database has not successfully been replicated to the MP, for instance), Fast Notification will fail to work properly.

When this happens, Notification Manager on the MP tries to send its updates to the database replica rather than the site server database but the problems with setting up the SQL replica prevent the notification files from processing correctly. This can create problems resulting in *.BOS files being placed in the /bad folder and error messages being populated in bgbmgr.log on the site server.

Under the Hood

The Fast Notification system relies on SQL Server Service Broker to communicate the change notifications being sent from the Notification Manager to the client. The Notification Server (located on the Management Point) monitors Service Broker for new notifications. In order to do this, each Notification Server needs a queue in the SQL database to monitor. These queues are stored in the SQL table BGB_Server and can be seen by running the following query:

Select * from BGB_Server

The screenshot below shows my lab environment when all of my MPs are using the site server database (DBID is a reference to the ID of the database the MP is currently using as part of the Fast Notification system):

clip_image003

There are other fields in this table, but for our purposes we are only interested in ServerName and DBID, which have been highlighted.

As stated already, this is the default configuration. All Management Points in a site use the site database and all list a DBID pointing to the site they report into.

Along with this table, there is a view that is important in determining Fast Notification behavior in a site. The following SQL view gives very similar information to what we find in BGB_Server:

Select * from v_BgbMP

Note in the screenshot below that the ServerName and DBID fields are identical to what we found in BGB_Server:

clip_image005

When the Fast Notification system is first set up, a queue is configured for each existing Management Point to enable it to participate as a Notification Server. The queue, as shown in the BGB_Server table above, indicates the name of the Notification Server and the database instance being used. In the case of S12-MP1, it is using the database instance used by the site server for PR1.

In order for a queue to be set up (thus allowing Fast Notification to work), BGB needs to be able to verify that the SQL database to be used by the Notification Server is accessible. If not, the queue will not be set up and messages such as the following may be reported in BGBServer.log:

ERROR: Can't retrieve SQL connection. Exception: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)

ERROR: Exception: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Using SQL Replica Databases for Management Points

The errors above are fairly rare and as long as the Management Points are configured to use their site database, these errors are unlikely to occur. However, in environments that have decided to configure database replicas for some or all of their Management Points, problems may occur.

To begin, setting up a database replica is one of the more complicated tasks in ConfigMgr 2012. There is an excellent prescriptive document (linked below) that, if followed precisely, will result in a functioning SQL replica. But as the document is very long and details, the opportunity exists for mistakes that could impact the Fast Notification system.

NOTE: Even if there are no mistakes in setting up the database replica for a Management Point, Fast Notification for that MP may break if you choose to re-direct your Management Point to the database replica before setup is completed. Given that there is a requirement to pause the setup to allow the SQL database to replicate to the MP, administrators who choose to do this should prepare for that site system not to participate in the Fast Notification process until the replica is completely configured (until SQL Broker is completed its part of setup). During this time, bad .BOS files should be expected in the /bad folder beneath BGB.BOX on the site server. Once the replica is completed and all settings confirmed, these files should no longer populate into the /bad folder any longer.

Because this is a complicated process, it is highly recommended that everyone be thoroughly familiar with the process of building database replicas for Management Points (MPs) in your environment.  Making mistakes or leaving out steps in this process are likely to impact the Fast Channel Notification system, as well as causing additional issues in your environment.

The process of configuring a database replica for the MPs in your environment can be found below:

Configure Database Replicas for Management Points

How Database Replicas Impact Client Fast Channel Notification

Building a database replica is a non-impactful task to your MPs or to the Fast Channel component of ConfigMgr until you point your MP toward the replica to start using it.  If the database replica is configured correctly when you do this, no problems will result.  If you do it prematurely or if there have been mistakes in the deployment of your database replica, the issue noted at the start of this blog can result.

In the present scenario, the database replica has only been partially completed so that the problems I have been mentioning can be illustrated.  As can be seen, whenthe Management Point begins to use the database replica, changes in Fast Notification can also be seen, including the presence of bad .BOS files suddenly appearing in the BGB.BOX\BAD folder on the site server.

First, let’s look at the changes to the Fast Notification tables in the site server database. Previously, when querying the v_BgbMP view on the site server, we received the following result:

clip_image002

Similarly, when querying BGB_Server, we saw the following:

clip_image004

Now, when we query these same tables, we get the following results:

Select * from v_BgbMP

clip_image006

Select * from BGB_Server

clip_image008

Note that v_BgbMP now shows that S12-MP1 with a new DBID which is a hexadecimal string rather than the 3-digit site code we saw previously. Likewise, S12-MP1 has disappeared altogether from BGB_Server! And looking further, we find that within the bgb.box\bad folder, .BOS files have started to appear:

clip_image010

Comparing the time when these files started appearing, we can see that they began to be populated when we pointed out Management Point to the database replica. These files will appear every 5 minutes for each MP that has been set to use the database replica.

Also, if you check the ConfigMgr logs, errors similar to the following will be reported (NOTE: Verbose and Diagnostic Logging are enabled on both servers, so if your results differ consider enabling additional logging):

BGBMGR.LOG on Site Server

SQL MESSAGE: spLogEntry - ERROR: Failed to send message to BGB server 50331672. SQL Error: 50000 SQL Message: ERROR 50000, Level 16, State 1, Procedure spGetSSBDialogHandle, Line 58, Message: Route is not defined for target site with service name ConfigMgrBGB_Site0x8eae7c6bf36b15ed2ea186928fcf6c4e. SMS_NOTIFICATION_MANAGER 9/9/2014 9:43:05 AM 5112 (0x13F8)

SQL MESSAGE: spLogEntry - ERROR: Failed to setup conversation for BGB server S12-MP1.w2k12-lab.local for db 0x8eae7c6bf36b15ed2ea186928fcf6c4e SQL Error: 266 SQL Message: Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 0. SMS_NOTIFICATION_MANAGER 9/9/2014 9:43:05 AM 5112 (0x13F8)

SQL MESSAGE: spLogEntry - ERROR: Failed to setup conversationi for S12-MP1.w2k12-lab.local with DB ID 0x8eae7c6bf36b15ed2ea186928fcf6c4e SMS_NOTIFICATION_MANAGER 9/9/2014 9:43:05 AM 5112 (0x13F8)

BGBMGR.LOG on Management Point

Generated BGB online status DELTA report C:\SMS\MP\OUTBOXES\bgb.box\Bgbb91uq.BOS (version: 1745) at 09/09/2014 09:43:31 SMS_NOTIFICATION_SERVER 9/9/2014 9:43:31 AM 3416 (0x0D58)
Retrieving SQL connection... SMS_NOTIFICATION_SERVER 9/9/2014 9:44:18 AM 3420 (0x0D5C)
ERROR: Can't retreive SQL connection. Exception: Cannot open database "cm_mp1" requested by the login. The login failed.~~Login failed for user 'NT AUTHORITY\SYSTEM'. SMS_NOTIFICATION_SERVER 9/9/2014 9:44:18 AM 3420 (0x0D5C)
ERROR: Don't have SQL connection when get client certificate for client (Type: SCCM ID: GUID:AF48CDA6-5F4A-404B-A362-0D347619F36B) SMS_NOTIFICATION_SERVER 9/9/2014 9:44:18 AM 3420 (0x0D5C)
ERROR: Can't do post authentication without client certificate stored in regsitration. SMS_NOTIFICATION_SERVER 9/9/2014 9:44:18 AM 3420 (0x0D5C)
ERROR: Failed to authenticate with client [fe80::ccaa:80d:3d61:6148%12]:60165. SMS_NOTIFICATION_SERVER 9/9/2014 9:44:18 AM 3420 (0x0D5C)
ERROR: Don't have SQL connection when get client certificate for client (Type: SCCM ID: GUID:AF48CDA6-5F4A-404B-A362-0D347619F36B) SMS_NOTIFICATION_SERVER 9/9/2014 9:45:18 AM 3420 (0x0D5C)
ERROR: Can't do post authentication without client certificate stored in regsitration. SMS_NOTIFICATION_SERVER 9/9/2014 9:45:18 AM 3420 (0x0D5C)
ERROR: Failed to authenticate with client [fe80::ccaa:80d:3d61:6148%12]:60167. SMS_NOTIFICATION_SERVER 9/9/2014 9:45:18 AM 3420 (0x0D5C)
ERROR: Don't have SQL connection when get client certificate for client (Type: SCCM ID: GUID:AF48CDA6-5F4A-404B-A362-0D347619F36B) SMS_NOTIFICATION_SERVER 9/9/2014 9:45:23 AM 3420 (0x0D5C)
Can't verify signature in message without client certificate for client SCCM GUID:AF48CDA6-5F4A-404B-A362-0D347619F36B SMS_NOTIFICATION_SERVER 9/9/2014 9:45:23 AM 3420 (0x0D5C)
Invalid hook to be decoded. Authentication SMS_NOTIFICATION_SERVER 9/9/2014 9:45:23 AM 3420 (0x0D5C)
Failed to process SignIn message from client fe80::ccaa:80d:3d61:6148%12:60166. SMS_NOTIFICATION_SERVER 9/9/2014 9:45:23 AM 3420 (0x0D5C)
Wait 300 seconds for notifications... SMS_NOTIFICATION_SERVER 9/9/2014 9:46:28 AM 2060 (0x080C)
Retrieving push tasks from database... SMS_NOTIFICATION_SERVER 9/9/2014 9:46:28 AM 4740 (0x1284)
Retrieving SQL connection... SMS_NOTIFICATION_SERVER 9/9/2014 9:46:28 AM 4740 (0x1284)
ERROR: Can't retreive SQL connection. Exception: Cannot open database "cm_mp1" requested by the login. The login failed.~~Login failed for user 'NT AUTHORITY\SYSTEM'. SMS_NOTIFICATION_SERVER 9/9/2014 9:46:28 AM 4740 (0x1284)
ERROR: Don't have SQL connection when retrieve push tasks SMS_NOTIFICATION_SERVER 9/9/2014 9:46:28 AM 4740 (0x1284)
Wait 300 seconds to restart Task Manager... SMS_NOTIFICATION_SERVER 9/9/2014 9:46:28 AM 4740 (0x1284)
Retrieving online resync flag from database... SMS_NOTIFICATION_SERVER 9/9/2014 9:48:32 AM 3416 (0x0D58)
Retrieving SQL connection... SMS_NOTIFICATION_SERVER 9/9/2014 9:48:32 AM 3416 (0x0D58)
ERROR: Can't retreive SQL connection. Exception: Cannot open database "cm_mp1" requested by the login. The login failed.~~Login failed for user 'NT AUTHORITY\SYSTEM'. SMS_NOTIFICATION_SERVER 9/9/2014 9:48:32 AM 3416 (0x0D58)
ERROR: Don't have SQL connection when get resync flag SMS_NOTIFICATION_SERVER 9/9/2014 9:48:32 AM 3416 (0x0D58)

The obvious question is “what has happened and how do we fix it?” First, let’s discuss what has happened.

When Fast Notification is first set up, a stored procedure is run called sp_BgbSetupQueue. You can see that this ran by reviewing ConfigMgrSetup.log on your site server. In my lab, it has only run a single time even with all of the changes I’ve made to it. Its code (along with the code for all stored procedures discussed in this section) can be found in Appendix B below.

After the Fast Notification queues are set up, another stored procedure named sp_BgbSyncBgbServer runs periodically. Its purpose is to synchronize the Management Point computers with the BGB servers (the Notification Server component that is supposed to reside on every Management Point). According to the comments in the stored procedure, it does the following:

  1. Sync the BGB server list with the MP server list
  2. Delete conversations for old BGB servers
  3. Setup conversations for new BGB servers

One thing that sp_BgbSyncBgbServer does is to call another stored procedure called sp_BgbDeleteQueue. Sp_BgbDeleteQueue queries for the ServerName and DBID properties of rows in the BGB_Server table. But it only does this if those rows don’t also exist in the v_BgbMP view. If it returns any rows as a result of its query, it deletes those rows from BGB_Server.

NOTE: If v_BgbMP and BGB_Server return the same information (each row in v_BgbMP is also in BGB_Server and both have the same information in ServerName and DBID), then no rows are returned to sp_BgbDeleteQueue and nothing is deleted from BGB_Server

In our case we have shifted the MP to use the replica database before the replica is completed (the same result would apply if it completed but some steps were missed or done incorrectly.  Due to this, the DBID field in v_BgbMP is now different than the DBID field in BGB_Server when comparing the Management Point row. Thus, when sp_BgbDeleteQueue is called, it returns the Management Point information and acts on it by deleting the row representing S12-MP1 from BGB_Server.

Because BGB_Server no longer has a row for the Management Point that has started using the MP Replica, the .BOS files using that Notification Server cannot be processed and are placed in bgb.box\bad.

Resolving the Problem

To resolve this problem, there are two choices:

  1. Redirect the Management Point back to the Primary Site database.  This will quickly clear up the issue, though it doesn’t serve as a long term fix
  2. Complete (or correct) the database replica setup so that your MP can participate in Fast Channel Notification once again

To confirm your database replica is set up correctly, carefully review every step you have gone through to configure your database replica.  Pay particular attention to the following items that are easy to overlook:

  1. Ensure that all certificates are properly exchanged and that communication between the databases is functioning (both SQL Replication and SQL Broker)
  2. Verify that both the Primary Site database and the replica database are using the same SQL Broker port
  3. If a named instance has been used for the replica database, make sure the stored procedures you ran when setting up SQL Broker explicitly points to these named instances

The document referenced above showing how to set up database replicas goes into this in depth, so I won’t repeat everything here.  Suffice to say that when you have successfully confirmed your database replica has been completely and successfully deployed, the Fast Channel notification issues should disappear.

Confirming the Problem is Resolved

Once you have completed all phases of setting up the Management Point Database Replica, it can be confirmed whether BGB is still experiencing the problems noted above under “BGB Revisited”.

By looking at the following tables, we can see how successful completion of the MP Database Replica has had on BGB.

First we start by reviewing the v_BgbMP view by running the following query:

Select * from v_BgbMP

As we can see, the results of this view are much the same as before. S12-MP1 still shows a hexadecimal string rather than a 3-digit site code for its DBID value.

Next, we check BGB_Server with the following query:

Select * from BGB_Server

Whereas before, S12-MP1 had disappeared from this table, it has now been re-added. This signifies that BGB is once again functioning properly.

To confirm this, we can check the BGB.BOX\BAD folder to verify that no further .BOS files are being submitted (these files submit every 5 minutes so if there are some remaining from before, delete them and allow the time to pass to see if more are added). You should see that the /BAD folder is now empty as shown below:

Conclusion

Many of you will never have a need to use database replicas in your environment.  And if you do, hopefully you will find that you don’t produce any problems (such as the one described in this article)  in the process.  If you do, hopefully the information contained in this blog will help you understand what’s going on and can guide you to a solution.