Event 2115 after install SQLServer to CU15

Roman Annenko 141 Reputation points
2020-12-10T12:56:56.687+00:00

Hi!

I have scom 1807 which worked fine until I've installed SQL Server 2016 CU15 update on its database server.
After that my management server started register 2115 events with growing time in "...has not received a response..." message and dataaccess and configurations service monitors became grey. All 2115 events are only about Operational DB collect workflows:
Microsoft.SystemCenter.CollectDiscoveryData
Microsoft.SystemCenter.CollectEventData
Microsoft.SystemCenter.CollectAlerts
Microsoft.SystemCenter.CollectPublishedEntityState
Microsoft.SystemCenter.CollectSignatureData
Microsoft.SystemCenter.CollectPerformanceData
There's nothing about DWH workflows.

I know the 2115 event points to DB performance problem but there's none of them present.
I've checked SQL Server and found it almost idle - neither high cpu or disk load nor db locks, enough disk and db space.
Seems like management server isn't posting data to db.
Perfmon counters for OpsMgr DB Write Action modules show zeroes.
Rebooting both servers, cleaning OpsMgr health service state didn't help.
Strange is that the 2nd management server (audit collector) in this management group is still active, its services are green.
Have installed the same update on SQL Server for test management group - there's no any problem.

I couldn't find any errors in event logs on both management and sql server pointing to the cause.
ManagementServer log starts filling with 2115 in 8 minutes after health service restart.
Tried to find something in tracelogs but found only this which seems not quite relevant but is frequently repeating:

[0]6632.8172::12/10/2020-10:21:43.727 [ConnectorManager] [] [Verbose] :BindDataSource::postPendingDataItems{binddatasource_cpp1056}Preparing to post item 412975 for Management Group 58a1e439-a592-f277-6546-8eb6b756e5b4, Rule 762c9f4a-582a-2bf0-cb9d-51c341ba8bf9, Target 0d828c27-f087-5562-3644-4e35484a7083
[0]6632.8172::12/10/2020-10:21:43.727 [ConnectorManager] [] [Verbose] :BindDataSource::postPendingDataItems{binddatasource_cpp1135}Peek result : 0(ERROR_SUCCESS)
[0]6632.8172::12/10/2020-10:21:43.727 [ConnectorManager] [] [Verbose] :BindDataSource::postPendingDataItems{binddatasource_cpp1056}Preparing to post item 412976 for Management Group 58a1e439-a592-f277-6546-8eb6b756e5b4, Rule 762c9f4a-582a-2bf0-cb9d-51c341ba8bf9, Target 0d828c27-f087-5562-3644-4e35484a7083
[0]6632.8172::12/10/2020-10:21:43.727 [ConnectorManager] [] [Verbose] :BindDataSource::postPendingDataItems{binddatasource_cpp1135}Peek result : 258(WAIT_TIMEOUT)
[0]6632.8172::12/10/2020-10:21:43.727 [ConnectorManager] [] [Information] :BindDataSource::postPendingDataItems{binddatasource_cpp1203}Wrote 4 data items.  Management Group 58a1e439-a592-f277-6546-8eb6b756e5b4, Rule 762c9f4a-582a-2bf0-cb9d-51c341ba8bf9, Target 0d828c27-f087-5562-3644-4e35484a7083

where the "Rule 762c9f4a-582a-2bf0-cb9d-51c341ba8bf9" appeared to be "Performance data collector" from "Data Warehouse Library" management pack.

Where else to look to find the cause of problem?

Operations Manager
Operations Manager
A family of System Center products that provide infrastructure monitoring, help ensure the predictable performance and availability of vital applications, and offer comprehensive monitoring for datacenters and cloud, both private and public.
1,409 questions
0 comments No comments
{count} votes

Accepted answer
  1. Roman Annenko 141 Reputation points
    2020-12-27T20:49:40.017+00:00

    Well, I've found the cause of problem (but not root cause).
    I started to migrate all monitoring to new management server, and after network monitoring pool migration I've noted that 2115 stopped on old server and started on the new one.
    I've found a couple of network devices in monitoring for which setting them to maintenance stopped generating events and started management server to work normally. Finally I deleted and rediscovered all net devs and it fully fixed the problem.
    Those problematic devs were not rediscovered and were excluded from further monitoring.

    So there can be a situation when monitored network device(s) can break functioning of management server and there's no tool to investigate the cause. Event log, tracelogs, performance counters - nothing could give a hint where to look for the cause.

    I wonder if anyone faced the same situation?

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Roman Annenko 141 Reputation points
    2020-12-18T13:42:26.577+00:00

    Well, I restored the scom database server to state before SQL2016 CU15 install.
    And this didn't fix the problem.

    I have installed the second management server and moved there the majority of agents there.
    New server does not experience 2115 problem, which confirms the problem is/was not with SQL, but with the old management server, and it arose after its reboot.

    Analyzing traces from management server i could find that some OM subsystem continues to collect data from agents and write it to the local queue and health service store. But other subsystem responcible for running OM DB writing workflows (like "Collect object state" or "Collect performance data") isn't working, which causes 2115 events and finally items queue and HS store overflow.
    I've confirmed that on the working management server these modules leave traces as "BaseDbWriteModule":

    [0]5532.10484::12/14/2020-16:56:39.729 [Bid2Etw_Microsoft_Mom_DatabaseWriteModules_1_Trace] [] [Information] :{Information_TextW}<BaseDbWriteModule.OnNewDbDataItamsWorker> PerformanceWriteModule module's worker thread is acking the dataitem.
    [0]5532.10484::12/14/2020-16:56:39.729 [Bid2Etw_Microsoft_Mom_DatabaseWriteModules_1_Trace] [] [Information] :{Information_TextW}<BaseDbWriteModule.OnNewDbDataItamsWorker> PerformanceWriteModule module's worker thread is calling the completion callback.
    [0]5532.10484::12/14/2020-16:56:39.729 [Bid2Etw_Microsoft_Mom_DatabaseWriteModules_1_Trace] [] [Information] :{Information_TextW}<BaseDbWriteModule.OnNewDbDataItamsWorker> PerformanceWriteModule module's worker thread is requesting the next dataitem.
    [0]5532.10484::12/14/2020-16:56:39.729 [Bid2Etw_Microsoft_Mom_DatabaseWriteModules_1_Trace] [] [Information] :{Information_TextW}<BaseDbWriteModule.OnNewDbDataItamsWorker> PerformanceWriteModule module's worker thread is done.
    [0]5532.10484::12/14/2020-16:56:39.831 [Bid2Etw_Microsoft_Mom_DatabaseWriteModules_1_Trace] [] [Information] :{Information_TextW}<BaseDbWriteModule.OnNewDbDataItams> PerformanceWriteModule module has received new dataitem.
    [0]5532.10484::12/14/2020-16:56:39.831 [Bid2Etw_Microsoft_Mom_DatabaseWriteModules_1_Trace] [] [Information] :{Information_TextW}<BaseDbWriteModule.OnNewDbDataItamsWorker> PerformanceWriteModule module's worker thread has received the dataitem.
    

    And on my broken server these traces are missing. The sources for these OMDB writing workflows are scom system provider modules like "Microsoft.SystemCenter.PublishedAlertProvider". Obviously these providers do not work, do not provide items to OMDB writing workflow and do not initiate them. But since these providers are implemented as "native class modules" there's no information about their implementation other than:

    <Native>
       <ClassID>C3339855-80B3-4c06-B7AB-5C5D97B59A0D</ClassID>
    </Native>
    

    So my investigations stopped here. I can't find neither information about which subsystem running these providers, nor errors about it in traces nor how to restore these providers.

    Any idea?

    0 comments No comments