Can’t get your Linux computer discovered? Check your network configuration.

Recently I was doing some tests and was trying to deploy a new Linux computer to be managed by Operations Manager. The discovery wizard kept failing with “unspecified error”. That’s really helpful, right? Well, I did what I normally do when i have any kind of discovery problem – I started DebugView to watch what was going on behind the scenes during the discovery. It didn’t end up being very helpful either, though it did show where the process was failing. Here’s what DebugView was showing me:

Beginning ExecuteOSInformationScript on thread id: 10 10| Executing DiscoveryScript.Discovery.Task 10| Returned from DiscoveryScript.Discovery.Task 10| DiscoveryScript.Discovery.Task returned as succeeded 10| Return from DiscoveryTaskHelper.ExecuteOSInformationScript() Return from DiscoveryTaskHelper.ExecuteSSHDiscovery() Microsoft.MOM.UI.Console.exe Error: 0 : 10.10.10.15: ExecuteSSHDiscovery failed with exception: <stdout></stdout><stderr></stderr><exception>Unspecified problem</exception>

 

So my discovery was failing after deploying the discovery script to the computer. I still didn’t have enough info to figure out the problem, so I looked at the debug logging for the SCX modules.

Handy Tip:   (from Cross-Platform Logging Methods on TechNet) You can enable debug logging of the SCX modules on the OpsMgr server by going to the C:\Windows\Temp directory and creating a blank file named EnableOpsMgrModuleLogging (no extension). This will enable the creation of several new log files in the same directory:

  • DeployFile.vbs.log
  • SCXCertWriteAction.log
  • SCXLogModule.log
  • SCXNameResolverProbe.log
  • SSHCommandProbe.log
  • SSHCommandWriteAction.log

Utilizing the DeployFile.vbs.log and the SSHCommandProbe.log files, I could see a little more detail into what was happening. Here’s what the logs showed:

Transferring file: C:\Program Files\System Center Operations Manager 2007\AgentManagement\UnixAgents\scx-1.0.4-252.centos.5.x86_64.rpm to location: /tmp/scx-root/ Verifying that file: scx-1.0.4-252.centos.5.x86_64.rpm was transferred properly /tmp/scx-root/scx-1.0.4-252.centos.5.x86_64.rpm

SSHCommandProbe::DoProcess preparing SSH call 22 root sh /tmp/scx-$USER/GetOSVersion.sh; EC=$?; rm -rf /tmp/scx-$USER; exit $EC Enter SSHFacade::RunCommand ExpectedSSHFacadeException Unspecified problem

So this told me the file was getting transferred (so not indicating problems with SSH) but when it was being run, there was no output. Hmmm. I’m still no further to finding the answer. I logged on to the Linux computer locally and ran the GetOSVersion.sh script using the same command line as above, and got a 0 return code and the appropriate XML. That’s weird. The script works fine, OpsMgr can connect to the Linux computer via SSH, it can drop the file there and run it, but nothing comes back.

Next I get out WinSCP to manually transfer the agent over there, and in opening up an SSH connection, WinSCP actually times out. This is interesting since both of these are VMs running on the same host, sharing the same networks. Now I realized that both the OpsMgr VM and the Linux VM have two network connections. One to my corporate network (so I have Internet connectivity) and one to a private 10.x network. Perhaps there are issues with the network connections not figuring out where the communication should go? I disabled my corporate network connections on both VMs, leaving only my private 10.x network on each. Magically, WinSCP connects instantly. It couldn’t be this simple could it? Well if you read the title of this article, you already know the answer to that. :)

Disabling the multiple network connections and keeping the single subnet active enabled the discovery to occur without a problem. And, only now did I realize that when both networks were active, discovery was taking a really long time. I did notice it at the time and wondered why it was taking so long, but I didn’t really link the issue until I saw how fast discovery was with the single network. Now I know that you can’t just go disabling NICs in a production system – they’re there for a reason. But what you can do is see if there are any conflicts with DNS, gateways, or other issues that might slow down the communication process to the point the discovery process times out on getting things done. The tell-tale signs are a really slow discovery (it shouldn’t take more than 30-45 seconds to get back the initial discovery data).

Hopefully my pain is your gain and this will be helpful to those having issues!