SCOM Linux agent longer working after upgrade to 2019

Jasper Van Damme 111 Reputation points
2020-09-21T11:09:24.18+00:00

Hi everyone,

A while ago we migrated our test environment of SCOM 1807 to SCOM 2019. We did this by migrating the database from SQL 2014 to 2016 as well as setting up 2 new management servers on a supported OS (2016 server).
We decommissioned the old SQL server and 2 management servers.

However, since the upgrade our linux agent data collection no longer seems to be working.

Here are the steps that we have taken to migrate the linux monitoring:

  • Cross signed the certificates with the tool to the new management servers
  • Added the new management servers in the linux resource pool
  • Made sure all linux run as profiles were migrated to the linux pool.
  • Removed the old management servers from the pool
  • Used kevin holman's registry keys to optimize for large environments. (only 11 linux agents active at the moment)
  • Installed UR2 and the Linux management packs that come with it.

After the upgrade I have noticed that:

  • Windows agents are working
  • Network monitoring is working
  • No errors or warnings on either management server with the exception of the occassional SNMP error (due to ports that no longer exist)
  • Linux performance data is no longer collected
  • Monitoring of linux machines in general no longer seem to work
  • The verbose logs of the modules on the management servers don't tell me anything.
  • I can access the linux servers from the management servers with Putty, with the user that is defined in the runas profiles (account is the same for privileged/action account in test)
  • TCP 1270 is open from the management servers as well

I can also no longer install any new linux agent, even when I uninstall the agent manually beforehand (rpm command). Windows agent install fine.

I have been working with SCOM for several years but am sort of at a loss here. We will now be making adjustments to the sudo rules for SCOM 2019, but I don't believe this will make much of a difference since this is only really needed for the elevated commands, which are not needed for performance collection for example.

Clearing the cache on the management servers and restarting the services also has no effect on the monitoring.

Any ideas?

Br,
Jasper

Operations Manager
Operations Manager
A family of System Center products that provide infrastructure monitoring, help ensure the predictable performance and availability of vital applications, and offer comprehensive monitoring for datacenters and cloud, both private and public.
1,408 questions
0 comments No comments
{count} votes

Accepted answer
  1. Jasper Van Damme 111 Reputation points
    2020-09-24T12:36:37.01+00:00

    Hello Stoyan,

    As shown in my previous post, i managed to resolve the issue. It was due to the proxy server. Because the Linux agent uses a webservice to communicate, we had to properly configure the exclusions on the management servers for communication to work.

    And to answer your question, yes rebooting, flushing the cache on the management server was all done, it was all due to the proxy.
    It seems that SCOM 2019 (or any version) has no way of determining if a proxy server is preventing communication with the agent, which I find to be a major oversight in the default linux heartbeat monitoring.

    br,
    Jasper


4 additional answers

Sort by: Most helpful
  1. SChalakov 10,261 Reputation points MVP
    2020-09-21T11:28:17.387+00:00

    Hi Jasper,

    Let me first say that adjusting the permissions is very important mainly because you cannot install new Linux agents if your permissions are incorrect or incomplete. This being said I think you are on the right track with this step.
    When you added your new Management Servers, did you also install all the certificates from the other ressource pool members? X-plat certificates? This is also a critical step.
    Can you please also check the certificates of your managed systems:

    Troubleshooting monitoring of UNIX and Linux computers

    Please go through this one also and ensure your certificates are configured properly:

    Troubleshoot UNIX/Linux agent discovery in Operations Manager

    If they aren't, you will have to uninstall the agent and install it again or manually sign and copy the certificate over.

    Can you please post also more details about the OS version of your managed systems?

    Regards,
    Stoyan

    0 comments No comments

  2. Jasper Van Damme 111 Reputation points
    2020-09-21T11:56:57.627+00:00

    Hello Stoyan,

    Thanks for your reply. To answer your questions:

    • I have exported and imported the certificates on each management server (even the old ones that are now decommed) so that should be OK.
    • The certificates normally have not changed, as the upgrade does not succeed. The error we are getting in the discovery wizard is 'unknown parameter s' when trying to install an agent which is very vague.
    • The distro's are mostly Redhat 7.X or Centos 7.X, these management packs have already been updated with the UR2 rollup as well.

    The strange thing is, if it would be a certificate error I would expect a certificate alert from the machine, which is not the case.

    I do see that when I run the command: openssl x509 -noout -in /etc/opt/microsoft/scx/ssl/scx.pem -subject -issuer -dates

    The issuer is pointing to the management server that is decommissioned, could that be a potential problem?

    Br,
    Jasper


  3. Jasper Van Damme 111 Reputation points
    2020-09-21T12:45:44.837+00:00

    Hey Stoyan,

    The eventviewer indicates zero errors, and all the agents are green, but data collection is not occurring as I have received 0 alerts or 0 performance metrics since the upgrade 3 weeks ago. None of the systems is collecting data properly.
    We have added the sudo rules on one system, but the issue still remains. This is the error we are getting during upgrade or installation of the agent:

    Unexpected DiscoveryResult.ErrorData type. Please file bug report.
    ErrorData: System.ArgumentNullException
    Value cannot be null.
    Parameter name: s
    at System.Activities.WorkflowApplication.Invoke(Activity activity, IDictionary2 inputs, WorkflowInstanceExtensionManager extensions, TimeSpan timeout) at System.Activities.WorkflowInvoker.Invoke(Activity workflow, IDictionary2 inputs, TimeSpan timeout, WorkflowInstanceExtensionManager extensions)
    at Microsoft.SystemCenter.CrossPlatform.ClientActions.DefaultDiscovery.InvokeWorkflow(IManagedObject managementActionPoint, DiscoveryTargetEndpoint criteria, IInstallableAgents installableAgents)

    Best regards,
    Jasper

    0 comments No comments

  4. SChalakov 10,261 Reputation points MVP
    2020-09-24T12:16:09.937+00:00

    Hi Jasper,

    that is pretty odd indeed, especially if your agents are being monitored.
    A pretty lame question, but still: did you try to flush the Health Service Cache on your Management Servers, part of the Linux Monitoring RP? Did you try to update the agents on the affected systems, especially after applying UR2?

    Regards,
    Stoyan