What if my discovery script fails?

Discovery is a very core feature of a management pack. Bugs in discovery can literally screw up the whole management pack. I saw an example of this today when I was debugging a problem where if the user disabled a network card, the WMI calls that I was making were returning no data and also a pretty generic error code. In my case, I wasn’t properly handling the condition and the result was that my discovery returned an empty list of discovered objects. This was interpretted by OpsMgr runtime as if the application was no longer installed and therefore, the objects representing the application were deleted. From the runtime perspective, this is completely the right behavior.

In case you are not familiar with how discovery data is handled, let me explain this a bit. In OpsMgr, there are two types of discovery data:

Snapshot - This is the default type of discovery data submitted by MP discoveries. When you submit this type of discovery data, basically you are saying "Whatever I am returning in the list of discovered objects, their properties and relationships is all that exists.". Let me explain what this means. Lets say that you wrote a management pack to discover SQL databases. The first time the discovery ran, it returned 2 databases called A and B. The next time the same discovery ran on the same server, it returned databases A and C. When the discovery data is actually processed on a management server, database B will be automatically deleted and database C will be added. Basically snapshot discovery is just like "last write wins". Whatever discovery data was submitted the last time the discovery ran, that is the data that will be stored into the OpsMgr DB and this data will be used for monitoring. The really nice thing about the functionality of Snapshot discovery is that as an MP author you dont need to have any logic in your management pack to figure out what you need to discover and what you discovered last time but is no longer available. All you need to do is execute your discovery logic and let OpsMgr take care of updating the DB based on the last data that your discovery returned.

Incremental - This is an advanced way to return discovery data and in most cases does not need to be used. With incremental discovery data, you can say "Add database A" or "Remove database B" without affecting the other databases you previously discovered. Unlike Snapshot discovery, the only changes that will take place are the changes that you specifically asked for by telling OpsMgr to add or remove a particular object.

Now let’s get back to the original problem. Given that the default type of discovery is Snapshot, I was causing all the alerts, performance, event and availability data to be deleted when my discovery was failing to execute properly and returning an empty list of discovered objects. From the operator perspective, this is a pretty bad experience as all of a sudden, a bunch of alerts would just vanish and the instances representing the application would vanish as well. The first solution that comes to mind is very simple. If I fail to retrieve the data from WMI, lets just not return any discovery data. This solution is definitely better as it does not cause the alerts and the objects that represent the application to vanish. The annoying side affect is that I start getting an alert saying that the OpsMgr agent ran a discovery script and it didn’t return any data. The exact title of the alert is 'Script or Executable Failed to Run". The problem with this alert is that all it basically tells you is that even though the script was expected to return some discovery data, it didn’t.

There is however a third option which puts you as the MP author in the driver seat. Basically what you want to do is tell OpsMgr something along the lines of "I am unable to query WMI to figure out whether the application is present or not, so I am not going to tell you anything, but I also don’t want you to start complaining and generating warnings."

In order to do this, you will need to switch to using Incremental discovery as it provides you with more granular control over the discovery data.

Basically here is what you want to do when you need to handle a failure in your discovery script and dont want to return any data, yet you dont want the existing data to be deleted and you dont want to see the generic warning about script execution failure.

‘Standard discovery code

Set oAPI = CreateObject("MOM.ScriptAPI")

Set oDiscoveryData = oAPI.CreateDiscoveryData(0,SourceId,ManagedEntityId)

If wasAbleToGetInfoFromWMI = false Then

‘Instead of Snapshot discovery, submit Incremental discovery data

oDiscoveryData.IsSnapshot = false

oAPI.LogScriptEvent “SampleScript.vbs”, 6125, 4, “Couldn’t get data”


Call oAPI.Return (oDiscoveryData)

Exit Sub

End If

 

'Do the standard discovery 

 

Call oAPI.Return (oDiscoveryData)

The key in the sample script above is highlighted in yellow. The highlighted line tells OpsMgr that we are submitting Incremental rather than Snapshot discovery data. The rules of incremental discovery data are that we can tell OpsMgr to add or delete particular instances that we discovered in the past execution of the discovery. In this case, we didn’t specify that we want to remove or add any instances. So basically we are saying “Do nothing”. This is exactly what we want as we can’t access the instrumentation at point and don’t even know whether the application that we previously discovered is still there or not. The line after the highlighted line is an example of how you can log an event to the OpsMgr event log from the script. By logging the event, you are alerting the operator that something is wrong in a much more meaningful way. Now you can create a simple alerting rule which will cause a warning alert to be create in OpsMgr when you are unable to perform the discovery. In that warning alert, you can put the information that the operators need in order to troubleshoot the issue, rather than presenting them with a generic “Script execution failure” warning which can be hard to troubleshoot.

By the way, here is a link to the documentation for the LogScriptEvent method which explains how to call it and what are the meanings of the parameters: https://msdn.microsoft.com/en-us/library/bb437630.aspx

Note: Other than handling a script execution failure or if you have a specific need to use incremental discovery, you should always use Snapshot discovery.