Heartbeat alerts for multiple servers triggered at the same time. Receiving event IDs 21006 and 21016 on agent servers.

JCC 101 Reputation points
2023-03-23T13:03:21.89+00:00

Multiple servers triggered heartbeat alerts without the failed to connect alert that is usually associated with it all at the same time on the same day. When I logged onto one of the servers, I checked the operations log and I could see error event IDs 21006 and 21016 continuously being logged. The first thing I did to troubleshoot was to stop the health service on the server, delete the health state folder, then restart the health service again, but the logs still generated the same Event IDs. I also cleared the health state folder on the management server as well, but no dice. I also confirmed port 5723 is open from the agent server to the management server. The next thing Im going to try is just uninstall the agent and then re-install it. If this doesn't work, does anyone else have any other tips I could try?

Operations Manager
Operations Manager
A family of System Center products that provide infrastructure monitoring, help ensure the predictable performance and availability of vital applications, and offer comprehensive monitoring for datacenters and cloud, both private and public.
1,413 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. SChalakov 10,261 Reputation points MVP
    2023-03-23T13:28:58.2966667+00:00

    Hi JCC,

    I don't think re-installing the agent will bring something here, but you can of course try with one system.

    Are the affected agents in the same domain or are they communicating to a Gateway server? If so, please make sure the certificates, sued for authentication are vallid both on the Gateway and also on the Management Servers.

    The next things you need to check:

    • Does name resolution still work for those clients? Are those able to resolved the FQDN of their management server/gateway?
    • If all of those systems are reporting to the same management server, you need to check if it is healthy by checking the "Operations Manager" -> "Management Servers" view and also check the event log of the respective management server.
    • Please check which managemment server are the agents configured with, do this locally on the agent. At the same time check how SCOM sees those agents, whether those are reporting to the same management server (SCOM console administration -> Agent managed). Both sources should show the same management server.

    Please post an update here, let's see what we can do about this.


    (If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)
    Regards
    Stoyan Chalakov