SCOM 2016 - Domain Controller agent with Server 2019 fails to authenticate
Hello guys,
I've been facing a very odd issue on my SCOM 2016 environment and am unable to figure it out what to do next after so many failed attempts. To make things clear, here is how my current scenario looks like:
- This is a multi-forest environment with +1k agent managed systems
- Management Server is at ACME.com forest and we have a Gateway Server for every monitored forest since there is no trust relationship between them. Certificate authentication is used between Management Server and Gateway Server
- We recently deployed a new forest with Windows Server 2019 and half of them is working while the other half is not even though they are on the same subnet and same domain
- The non-working machines have a Domain Controller role and have the following error IDs on Operations Manager log: 20070 and 20071 while my Gateway Server logs event ID 20002 and not sure if it is because authentication fails or not, but the truth is that they never show up under Pending Management of my Management Server, so no events about them are logged there
Note: I already have some DCs with Windows Server 2019 being monitored in other forests, so I did not look into any MP related issue as they all work just fine
Now for these particular DCs which are not working, here is what I have already checked:
-- 'NT AUTHORITY\SYSTEM' is allowed on HSLOCKDOWN.exe and HealthService is running properly using this Local System account
-- From DCs I can see that Test-NetConnection on port 5723 of the Gateway Server is working just fine
-- DNS is installed on Infoblox, but regardless of that, I can resolve my Gateway Server NetBios name, FQDN and IP address just fine from any of these DCs
-- HKLM:\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\<MyMG>\Parent Health Services\0 and both Authentication Name and Network Name are correct
-- Tried deleting HKLM:\SYSTEM\CurrentControlSet\Services\HealthService\Parameters\Management Groups\<MyMG> and then restarting the HealthService service
-- Tried uninstalling agents using MOMAgent.msi and rebooting the server
-- Tried rebooting my Management Server, Gateway Server and Agents, but still no luck
-- I do have certificates installed on these DCs (IPSEC + Client auth) + SSL, but by design I never imported them on my agents using MOMCertImport.exe nor manually by adding them to registry key because our Gateway does all the job required as it is marked to act as a proxy
-- Using 'StartTracing.cmd VER' I can see that it first tries to load certificates, but it does find them - see previous note - and later tries to use Kerberos to authenticate, it even briefly establishes a connection with my GW for a few seconds, but then closes it and the GW logs these events 20002
-- Since this seemed Kerberos related, I also checked Test-NetConnection using NetBios name, FQDN and domain on port 88 and they all got back as successful
-- These DCs also seems to have all SPNs required: HOST, RestrictedKrbHost, TERMSRV, WSMAN for both NetBios and FQDN
Anyway, as you can see, I've tried a lot of things and none of them worked, so I am running out of ideas of what to do next, so could any of you shed some light?
I really appreciate any support.
Best regards,
Diego