SCOM HealthService service stopped on all management servers at 2:00 AM

Question

SCOM HealthService service stopped on all management servers at 2:00 AM

Jesus Chao 156

Hi,

We have two separate SCOM environments (one for test/dev servers and one for production) that experienced a strange issue over the past weekend. All the management servers for both environments had the healthservice stop at 2:00 AM. There are 4 management servers in Production and 2 in Test/Dev. The environments only share a domain, network, and virtualization infrastructure. Everything else is separate (different databases, management group, etc).

Does anyone know of any processes (clean-up, etc) that runs at 2:00 AM in the morning that may have caused the health service to completely stop? This event in the operations manager log is recorded just before the service stops.

Provider

[ Name] Health Service ESE Store

EventID 327 [ Qualifiers] 0 Level 4 Task 1 Keywords 0x80000000000000
TimeCreated [ SystemTime] 2022-12-04T07:00:03.911475400Z EventRecordID 185075 Channel Operations Manager

HealthService (2604,D,51) Health Service Store: The database engine detached a database (1, C:\Program Files\Microsoft System Center\Operations Manager\Server\Health Service State\Health Service Store\HealthServiceStore.edb). (Time=0 seconds)

Revived Cache: 0 0
Additional Data: lgposDetach = 00006B03:000F:0000

Internal Timing Sequence:
[1] 0.000012 +J(0)
[2] 0.000001 +J(0)
[3] 0.000107 +J(0)
[4] 0.000001 +J(0)
[5] 0.0 +J(0)
[6] 0.009554 -0.005641 (3) WT +J(0) +M(C:-84K, Fs:10, WS:-68K # 0K, PF:-64K # 0K, P:-64K)
[7] 0.001785 +J(0)
[8] 0.002098 -0.000857 (1) WT +J(CM:0, PgRf:0, Rd:0/0, Dy:0/0, Lg:4096/2) +M(C:0K, Fs:1, WS:4K # 0K, PF:0K # 0K, P:0K)
[9] 0.010725 -0.005817 (6) WT +J(0) +M(C:0K, Fs:3, WS:-60K # 0K, PF:-68K # 0K, P:-68K)
[10] 0.001253 +J(0)
[11] 0.000387 +J(0) +M(C:0K, Fs:2, WS:0K # 0K, PF:52K # 0K, P:52K).

Dwayne 1 Reputation point

2023-10-26T01:58:56.2266667+00:00

I'm having a similar issue but its only impacting half the fleet of scom 2022u1 servers (win2022). ie we get the clean shutdown of the agent and no errors in the event logs that correspond to this to know why. tracing has been unhelpful. did you ever resolve this? I can also confirm 100% its not the suggested answer as we don't see similar behavior and do not use that veaam 'rubbish' MP, well that's my honest opinion on trialing it back in our old 2012r2 test env. Other solutions scaled much better and made less of a mess so it never went past 'trial'
Dwayne 1 Reputation point

2023-10-26T02:11:31.5633333+00:00

and before anyone asks re management packs etc the servers are split between multiple resource pools and at least one server works and another has this issue, its also always the same servers. so other servers that work have the same MP's deployed to them. timing on failure is variable.
Jesus Chao 156 Reputation points

2023-10-26T12:22:23.61+00:00

Hi Dwayne - no a solution never presented itself however we have not seen this issue occur in quite some time. It seemed however to occur at least once a month in the past. I assume perhaps an update resolved the issue.

2 answers

Your answer

Dwayne 1 Reputation point

2023-10-26T01:58:56.2266667+00:00

I'm having a similar issue but its only impacting half the fleet of scom 2022u1 servers (win2022). ie we get the clean shutdown of the agent and no errors in the event logs that correspond to this to know why. tracing has been unhelpful. did you ever resolve this? I can also confirm 100% its not the suggested answer as we don't see similar behavior and do not use that veaam 'rubbish' MP, well that's my honest opinion on trialing it back in our old 2012r2 test env. Other solutions scaled much better and made less of a mess so it never went past 'trial'
Dwayne 1 Reputation point

2023-10-26T02:11:31.5633333+00:00

and before anyone asks re management packs etc the servers are split between multiple resource pools and at least one server works and another has this issue, its also always the same servers. so other servers that work have the same MP's deployed to them. timing on failure is variable.
Jesus Chao 156 Reputation points

2023-10-26T12:22:23.61+00:00

Hi Dwayne - no a solution never presented itself however we have not seen this issue occur in quite some time. It seemed however to occur at least once a month in the past. I assume perhaps an update resolved the issue.

Answer 1

SChalakov 10,576

Hi Jesus (@Jesus Chao ),

can you please post some more info about the SCOM version and UR level in both environemnts?
Does this happen on regular basis or it just happened once?
When did you install those MGs?
Do you have MPs (example VMWare), which are installed in both environemnts and could cause that?

Reference:
Unstable Behavior from Ops Mgr Health Service¨
https://helpcenter.veeam.com/docs/mp/vmware_guide/unstable_health_service_behavior.html?ver=90

----------

(If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)
Regards
Stoyan Chalakov

Jesus Chao 156 Reputation points

2022-12-08T16:01:24.18+00:00

We are on SCOM 2019 UR3 on both environments

This has happened just once that I can remember. I thought it is really odd that it happened at the same exact moment on 6 management servers with 2 of those management servers being in a separate MG. It almost sounds like an internal SCOM process that didn't go well - thus my question as I am unaware of any maintenance processes that may or may not cause this.

The MGs have been active for over a year.

Naturally there are several MPs that are shared between environments that are standard MPs that MS provides (like SQL, IIS, Cluster). If you mean 3rd party MPs, we only have a few. So for example, Opslogix Ping monitor and Kevin Holman's SQL Run-As Config MP. All the others are provided by MS.

Answer 2

Dwayne 1

in general any DB maintenance scheduled should not stop the agents unless there are connectivity issues caused by it (opsmgr logs would show sdk issues) all agents in a scom environment should not fail at once unless there is a prolonged outage to DB, well from my experience in various sized systems up to 30 management servers over the last 12 years...

if that is the solution Kevin Holman has a page for that https://kevinholman.com/2017/03/08/recommended-registry-tweaks-for-scom-2016-management-servers/ under DAL

SChalakov 10,576 Reputation points

2023-10-26T09:39:36.6433333+00:00

Hi Dwayne,

I fully agree with:

in general any DB maintenance scheduled should not stop the agents unless there are connectivity issues caused by it (opsmgr logs would show sdk issues) all agents in a scom environment should not fail at once unless there is a prolonged outage to DB

The Registry are very important also. I would also recommend going through this:

System Center Operations Manager (SCOM) Management Group Performance Optimizations

Hope you find it useful.

(If the reply was helpful please don't forget to upvote or accept as answer, thank you)
Regards,
Stoyan
Jesus Chao 156 Reputation points

2023-10-26T12:20:15.2733333+00:00

in general any DB maintenance scheduled should not stop the agents unless there are connectivity issues caused by it (opsmgr logs would show sdk issues) all agents in a scom environment should not fail at once unless there is a prolonged outage to DB, well from my experience in various sized systems up to 30 management servers over the last 12 years...

Wow - I posted this a long time ago. Dwayne, to be clear, the agent (HealthService) restarts ONLY occurred on the management servers and not the member agents. When I asked about an internal process, my thought was perhaps there was a clean up job that would run on the MANAGEMENT servers that may cause the issue. I thought it was odd that two different SCOM environments that are not connected in any way other than being on the same domain would have all the Management Server HealthService STOP at the same time on the same days.

To be honest - we have not see this behavior in quite some time (that I have seen anyways). It seems like for a while it was occurring every once a month. Perhaps an update fixed the issue.

Share via

SCOM HealthService service stopped on all management servers at 2:00 AM

2 answers

Your answer