Cluster and maintenance mode
Maintenance mode and cluster became a big issue for most of our customers. Starting MM for physical computer serving as one of the cluster node causes FALSE-positives alerts raised from cluster infrastructure monitoring MP. Please accept our apologizes, let’s hope this post clears things out a bit as well as provides some insight/tools how to perform this maintenance mode process.
Following is class hierarchy and relationships (part of management pack guide, so it should not be completely new). We can see that managed entity Cluster contains all managed entities which create cluster infrastructure (remember that hosting is specialization of containment introducing lifetime dependency)
1. MANUAL PROCESS
Usual steps to maintenance physical computer that serves as one of the nodes for failover cluster should consist of finding cluster which contains this computer. The one must start maintenance mode for this managed entity (Cluster) while also starting MM for all contained entities as well. Inserting entity with all contained objects causes node (which extends (inherits from) Windows computer) to enter MM for every local application hosted by this computer (though in SP1, Health Service which is local app is exempt from this rule). Only thing left to do is to start MM for each health service and health service watcher associated with cluster node computers.
2. Powershell script
To automate above said work, you can use attached powershell script. It locates all the nodes related to cluster specified by name sent into script as first argument. After it locates cluster nodes, it also enters maintenance mode for heath service and health service watcher associated with those physical computers.
$clusterName = $args[0]
$HoursInMaintenance = $args[1]
$Description = $args[2]
& {
$ErrorActionPreference = "silentlycontinue"
Add-PSSnapin "Microsoft.EnterpriseManagement.OperationsManager.Client" -ErrorVariable errSnapin;
}
$pathOpsMgr = $env:ProgramFiles + "\System Center Operations Manager 2007"
cd $pathOpsMgr
.\Microsoft.EnterpriseManagement.OperationsManager.ClientShell.Startup.ps1
$startTime = (Get-Date).ToUniversalTime()
$endTime = $startTime.AddHours($HoursInMaintenance)
$clusterCriteria="Name='"+ $clusterName +"'"
$cluster = get-monitoringClass -name 'Microsoft.Windows.Cluster' | get-monitoringObject -criteria:$clusterCriteria
$clusterNodeClass=get-monitoringclass -name 'Microsoft.Windows.Cluster.Node'
$nodes=$cluster.GetRelatedMonitoringObjects($clusterNodeClass)
"Putting cluster " + $clusterName + " into maintenance mode recursively"
$cluster.ScheduleMaintenanceMode($startTime,$endTime,"PlannedOther",$Description,"Recursive")
foreach ($node in $nodes)
{
$healthServiceWatcherCriteria = "HealthServiceName='" + $node.Name + "'"
$healthServiceWatcher = get-monitoringclass -name 'Microsoft.SystemCenter.HealthServiceWatcher' | get-monitoringobject -criteria:$healthServiceWatcherCriteria
"Putting HSW " + $node.Name + " into maintenance mode"
& {
$ErrorActionPreference = "silentlycontinue"
New-MaintenanceWindow -startTime:$startTime -endTime:$endTime -monitoringObject:$healthServiceWatcher -comment:$Description
}
$computer = Get-Agent | Where-object {$_.PrincipalName –eq $node.Name}
$healthService = $computer.HostedHealthService
"Putting HS " + $node.Name + " into maintenance mode"
& {
$ErrorActionPreference = "silentlycontinue"
New-MaintenanceWindow -startTime:$startTime -endTime:$endTime -monitoringObject:$healthService -comment:$Description
}
}
3. Recap
These steps with entering MM for whole cluster will cause temporary suppression of all monitoring for said cluster infrastructure (including cluster nodes (physical computers)) because all workflows are unloaded. There will be no alerting on cluster infrastructure or computer and health service while instances are in maintenance mode. I will try to answer question if it is possible to enter just active cluster node into maintenance mode in some of my next posts.
Comments
- Anonymous
September 05, 2008
Another issue with the Cluster monitoring class is it doesn't have any relationship with Site monitoring class. This behavior is common to other non-hosted classes (i.e. Exchange organization, AD domain, ...). In a VAP scenario it's not uncommon to put an entire customer (Site) in MM, since cluster are not contained in any Site extra steps need to be taken to be sure to put the entire site into MM. Obviously the lack of the (site) - (non hosted class) releationship has other side effects, for example alerts generated by monitors of these classes are not assigned to any Site so it's not easy to appropriately scope customer consoles. Hopefully SP2 will address this issue, won't it? :-) Cheers Daniele