Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to: ✔️ Linux VMs
This article lists the common causes of Pacemaker service startup issues and provides resolutions to fix them.
Scenario 1: Pacemaker startup failure because of SysRq-triggered reboot
The Pacemaker service doesn't start if the last startup was triggered by a SysRq action. The Pacemaker service can start successfully after a normal restart. This issue is caused by a conflict between the STONITH Block Device (SBD) msgwait
time and the fast restart time of these Azure virtual machines (VMs), as specified in the /etc/sysconfig/sbd
file:
## Type: yesno / integer
## Default: no
#
# Whether to delay after starting sbd on boot for "msgwait" seconds.
# This may be necessary if your cluster nodes reboot so fast that the # other nodes are still waiting in the fence acknowledgement phase.
# This is an occasional issue with virtual machines.
#
# This can also be enabled by being set to a specific delay value, in # seconds. Sometimes a longer delay than the default, "msgwait", is # needed, for example in the cases where it's considered to be safer to
# wait longer than:
# corosync token timeout + consensus timeout + pcmk_delay_max + msgwait # # Be aware that the special value "1" means "yes" rather than "1s".
#
# Consider that you might have to adapt the startup-timeout accordingly # if the default isn't sufficient. (TimeoutStartSec for systemd) # # This option may be ignored at a later point, once Pacemaker handles
# this case better.
SBD_DELAY_START=no
Resolution for scenario 1
Put the cluster into maintenance-mode:
sudo crm configure property maintenance-mode=true
Edit the
/etc/sysconfig/sbd
file to change theSBD_DELAY_START
parameter toyes
.Remove the cluster from maintenance mode:
sudo crm configure property maintenance-mode=false
Restart the Pacemaker and SDB services, or restart both nodes:
sudo systemctl restart sbd sudo systemctl restart Pacemaker
Scenario 2: Pacemaker doesn't start and returns code 100 after node fencing
After the cluster node is fenced, the Pacemaker service exits without starting and returns an exit status code of 100.
systemctl status Pacemaker.service
● Pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/Pacemaker.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Wed 2020-05-13 23:38:21 UTC; 25s ago
Docs: man:Pacemakerd
https://clusterlabs.org/Pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html
Main PID: 1570 (code=exited, status=100)
Cause for scenario 2
If a node tries to rejoin the cluster after it's fenced but before the msgwait
time-out finishes, the Pacemaker service doesn't start. Instead, the service returns an exit status code of 100. To resolve the issue, enable the SBD_DELAY_START
setting, and specify an msgwait
delay for the startup of sbd.service. This allows more time for the node to rejoin the cluster, and it makes sure that the node can rejoin without experiencing the msgwait
conflict.
Notice that if the SBD_DELAY_START
setting is used, and the SBD msgwait
value is very high, other potential issues might occur. For more information, see Settings for long timeout in SBD_DELAY_START.
Resolution A for scenario 2
Put the cluster into maintenance-mode:
sudo crm configure property maintenance-mode=true
Edit the
/etc/sysconfig/sbd
file to change theSBD_DELAY_START
parameter toyes
.Make a copy of
sbd.service
:cp /usr/lib/systemd/system/sbd.service /etc/systemd/system/sbd.service
Edit
/etc/systemd/system/sbd.service
to add the following lines in the[Unit]
and[Service]
section:[Unit] Before=corosync.service [Service] TimeoutSec=144
Remove the cluster from maintenance-mode:
sudo crm configure property maintenance-mode=false
Restart the Pacemaker and SDB services, or restart both nodes:
sudo systemctl restart sbd sudo systemctl restart Pacemaker
Resolution B for scenario 2
Tweak the SDB device msgwait
time-out setting to be shorter than the time that's required for the SBD fencing action to finish and the sbd.service
to be restored after a restart. Edit the watchdog
parameter to be 50 percent of new msgwait
time-out value. This is a process of optimization that must be tuned on a system-by-system basis.
Scenario 3: Issue occurs in azure-events agent resource
Symptom for scenario 3
The crm status
output shows a "failed resource actions" error that affects the Azure Events Monitor resource:
'Failed Resource Actions:
* rsc_azure-events_monitor_10000 on node1 'error' (1): call=82, status='complete', exitreason='getInstanceInfo: Unable to get instance info', last-rc-change='2024-09-26 06:51:31 +10:00', queued=0ms, exec=94ms'
Cause for scenario 3
The Azure platform host (physical server) that runs the cluster node receives a platform-level host maintenance update. If the duration of this update exceededs the time that's required for the cluster resource to initiate a restart, the resource actions fail.
The following error entries are logged in /var/log/messages
on the cluster node:
python3[11159]: 2024-09-25T20:51:34.040196Z ERROR ExtHandler ExtHandler Error fetching the goal state: [ProtocolError] Error fetching goal state: [ResourceGoneError] [HTTP Failed] [410: Gone] b'<?xml version="1.0" encoding="utf-8"?>\n<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <Code>ResourceNotAvailable</Code>\n <Message>The resource requested is no longer available. Please refresh your cache.</Message>\n <Details></Details>\n</Error>'
2024-09-26T06:51:34.040396+10:00 node1 python3[11159]: Traceback (most recent call last):
2024-09-26T06:51:34.040432+10:00 node1 python3[11159]: File "/usr/lib/python3.6/site-packages/azurelinuxagent/common/protocol/wire.py", line 776, in update_goal_state
2024-09-26T06:51:34.040475+10:00 node1 python3[11159]: self._goal_stateupdate(silent=silent)
2024-09-26T06:51:34.040507+10:00 CVCJLPPAP001 python3[11159]: File "/usr/lib/python3.6/site-packages/azurelinuxagent/common/protocol/goal_state.py", line 147, in update
...
...
2024-09-26T06:51:34.040902+10:00 node1 python3[11159]: raise ResourceGoneError(response_error)
2024-09-26T06:51:34.040949+10:00 node1 python3[11159]: azurelinuxagent.common.exception.ResourceGoneError: [ResourceGoneError] [HTTP Failed] [410: Gone] b'<?xml version="1.0" encoding="utf-8"?>\n<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <Code>ResourceNotAvailable</Code>\n <Message>The resource requested is no longer available. Please refresh your cache.</Message>\n <Details></Details>\n</Error>'
The following error entries are logged in /var/log/messages
on the cluster node:
2024-09-26T06:51:30.997118+10:00 node1 azure-events: WARNING: Failed to reach the server: Gone
2024-09-26T06:51:31.005469+10:00 node1 azure-events: ERROR: getInstanceInfo: Unable to get instance info
2024-09-26T06:51:31.013735+10:00 node1 pacemaker-execd[10439]: notice: rsc_azure-events_monitor_10000[2080672] error output [ ocf-exit-reason:getInstanceInfo: Unable to get instance info ]
2024-09-26T06:51:31.014026+10:00 node1 pacemaker-controld[10442]: notice: Result of monitor operation for rsc_azure-events on node1: error
2024-09-26T06:51:33.562902+10:00 node1 python3[11159]: 2024-09-25T20:51:33.562638Z WARNING MonitorHandler ExtHandler Error in SendHostPluginHeartbeat: [ResourceGoneError] [HTTP Failed] [410: Gone] b'<?xml version="1.0" encoding="utf-8"?>\n<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <Code>ResourceNotAvailable</Code>\n <Message>The resource requested is no longer available. Please refresh your cache.</Message>\n <Details></Details>\n</Error>' --- [NOTE: Will not log the same error for the next hour]
2024-09-26T06:51:34.040307+10:00 node1 python3[11159]: 2024-09-25T20:51:34.040196Z ERROR ExtHandler ExtHandler Error fetching the goal state: [ProtocolError] Error fetching goal state: [ResourceGoneError] [HTTP Failed] [410: Gone] b'<?xml version="1.0" encoding="utf-8"?>\n<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <Code>ResourceNotAvailable</Code>\n <Message>The resource requested is no longer available. Please refresh your cache.</Message>\n <Details></Details>\n</Error>'
2024-09-26T06:51:34.040396+10:00 node1 python3[11159]: Traceback (most recent call last):
2024-09-26T06:51:34.040432+10:00 node1 python3[11159]: File "/usr/lib/python3.6/site-packages/azurelinuxagent/common/protocol/wire.py", line 776, in update_goal_state
2024-09-26T06:51:34.040475+10:00 node1 python3[11159]: self._goal_state.update(silent=silent)
...
...
2024-09-26T06:51:34.040902+10:00 node1 python3[11159]: raise ResourceGoneError(response_error)
2024-09-26T06:51:34.040949+10:00 node1 python3[11159]: azurelinuxagent.common.exception.ResourceGoneError: [ResourceGoneError] [HTTP Failed] [410: Gone] b'<?xml version="1.0" encoding="utf-8"?>\n<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <Code>ResourceNotAvailable</Code>\n <Message>The resource requested is no longer available. Please refresh your cache.</Message>\n <Details></Details>\n</Error>'
If the resource doesn't start three times, it fails completely. This requires manual intervention by the administrator to resolve the issue and perform cleanup:
pacemaker-schedulerd[8995]: warning: Unexpected result (error: getInstanceInfo: Unable to get instance info) was recorded for monitor of rsc_azure-events:0 on node1 at Sep 26 06:51:31 2024
pacemaker-schedulerd[8995]: warning: cln_azure-events cannot run on node1 due to reaching migration threshold (clean up resource to allow again)
Note
If no relevant log entries that indicate host-level updates or connectivity issues are captured from the platform layer, configure rtmon to make sure that the OS records the necessary details to identify the root cause, even if the platform doesn't capture logs.
Resolution for scenario 3
Make sure that Pacemaker is configured correctly for Azure scheduled events. For more information, see Configure Pacemaker for Azure scheduled events.
We recommend that you modify the rsc_azure-events
resource primitive configuration to incorporate a 60-second delay before the restart operation begins. This delay allows the Azure agents to finalize the host-level maintenance tasks without triggering complete resource failure. For more information, see the following SUSE KB article:rsc_azure-events resource fails with error: Unable to get instance info.
To do this, follow these steps:
Put the cluster into maintenance mode:
sudo crm configure property maintenance-mode=true
Enter the interactive configuration mode:
sudo crm configure edit rsc_azure-events
Modify the
rsc_azure-events
parameters, as shown in the following output:primitive rsc_azure-events azure-events \ op monitor interval=10s timeout=240s \ op start timeout=10s interval=0s start-delay=60s\ op stop timeout=10s interval=0s \ meta allow-unhealthy-nodes=true failure-timeout=120s
Save your changes, and exit the editor.
Verify the changes:
sudo crm config show
Remove the cluster from maintenance mode:
sudo crm configure property maintenance-mode=false
Scenario 4: SAP HANA DB resource doesn't start
Scenario 4, Symptom 1 - SAP HANA DB resource failed with time-out error
The start operation of the SAP High-Performance Analytic Appliance database (HANA DB) cluster resource (for example: rsc_SAPHANA_DB01
) fails and returns a time-out error. However, HANA DB can be successfully started manually (outside the cluster control) while the cluster is in maintenance mode:
pacemaker-execd[xxx]: warning: rsc_SAPHANA_DB01_start_0 process (PID xxx) timed out
pacemaker-execd[xxx]: warning: rsc_SAPHANA_DB01_start_0:xxx - timed out after 3600000ms
pacemaker-execd[xxx]: notice: finished - rsc: rsc_SAPHANA_DB01 action:start call_id:25 pid:xxx exit-code:1 exec-time:3604053ms queue-time:0ms
pacemaker-controld[xxx]: error: Result of start operation for rsc_SAPHANA_DB01 on xxx: Timed Out
Cause for Scenario 4, Symptom 1
Resolution for Scenario 4, Symptom 1
To resolve the issue, extend both the start and stop timeout
parameters for the HANA DB resource, rsc_SAPHANA_DB01
, as recommended in the SUSE KB article, HANA DB resource failed to start.
Example
Put the cluster into maintenance mode:
sudo crm configure property maintenance-mode=true
To prevent the cluster from restarting HANA DB, clean up the resources from previous failures if the cluster reports an error because of a failed startup:
sudo crm resource cleanup
Edit the cluster configuration to update the
timeout
parameter of the start and stop operations:sudo crm config edit
Before:
primitive rsc_SAPHANA_DB01 ocf:suse:SAPHana \ op start interval=0 timeout=3600 \ op stop interval=0 timeout= 3600 \
After:
primitive rsc_SAPHANA_DB01 ocf:suse:SAPHana \ op start interval=0 timeout=7200 \ op stop interval=0 timeout=7200 \
Disable cluster maintenance mode:
sudo crm configure property maintenance-mode=false
Verify that the changes were made:
sudo crm config show
Scenario 4, Symptom 2 - SAP HANA DB resource doesn't start and returns unknown error
SAP HANA DB doesn't start, and it returns an unknown error
message.
The following text shows the sudo crm status
output when this issue occurs:
2 nodes configured
8 resources configured
Node node-1:
Node node-2:
Active resources:
Clone Set: cln_SAPHanaTopology [rsc_SAPHanaTopology]
rsc_SAPHanaTopology (ocf::suse:SAPHanaTopology): Started node-1
rsc_SAPHanaTopology (ocf::suse:SAPHanaTopology): Started node-2
Master/Slave Set: msl_SAPHana [rsc_SAPHana]
rsc_SAPHana_P40_HDB00 (ocf::suse:SAPHana): Stopped node-1 (Monitoring)
Resource Group: g_ip_P40_HDB00
rsc_ip_P40_HDB00 (ocf::heartbeat:IPaddr2): Started node-1
rsc_nc_P40_HDB00 (ocf::heartbeat:azure-lb): Started node-1
rsc_st_azure (stonith:fence_azure_arm): Started node-2
Failed Resource Actions:
* rsc_SAPHana_P40_HDB00_monitor_61000 on node-1 'unknown error' (1): call=32, status=complete, exitreason='',
last-rc-change='Fri Jun 10 00:33:41 2024', queued=0ms, exec=0ms
* rsc_SAPHana_P40_HDB00_start_0 on node-2 'not running' (7): call=55, status=complete, exitreason='',
last-rc-change='Fri Jun 10 00:33:41 2024', queued=0ms, exec=3093ms
Cause for Scenario 4, Symptom 2
Pacemaker can't start the SAP HANA resource if SYN failures
exist between the primary and secondary nodes.
The secondary cluster node is in WAITING4PRIM
status.
The /var/log/messages
folder contains the following SRHOOK=SFAIL
messages:
2024-06-10T00:31:40.106622+02:00 node-1 SAPHana(rsc_SAPHana_P2H_HDB00)[55890]: INFO: RA: SRHOOK1=
2024-06-10T00:31:40.149443+02:00 node-1 SAPHana(rsc_SAPHana_P2H_HDB00)[55890]: INFO: RA: SRHOOK2=SFAIL
2024-06-10T00:31:40.155744+02:00 node-1 SAPHana(rsc_SAPHana_P2H_HDB00)[55890]: INFO: RA: SRHOOK3=SFAIL
When you run sudo SAPHanaSR-showAttr
, the following sync status of the primary and secondary DB nodes is displayed:
sudo SAPHanaSR-showAttr
Global cib-time maintenance
--------------------------------------------
global Mon Jun 10 01:47:32 2024 false
Hosts clone_state lpa_fh9_lpt node_state op_mode remoteHost
roles score site srmode sync_state version vhost
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
node-1 UNDEFINED 10 online logreplay node-2 0:P:::: 5 DC1 syncmem SFAIL 2.00.046.00.1581325702 node-1
node-2 PROMOTED 1693237652 online logreplay node-1 4:P:master1:master:worker:master 150 DC2 syncmem PRIM 2.00.046.00.1581325702 node-2
Workaround for Scenario 4, Symptom 2
Pacemaker can't start the SAP HANA resource if SYN failures
exist between the primary and secondary cluster nodes. To mitigate this issue, manually enable SYN
between the primary and secondary nodes.
Important
Perform steps 2, 3, and 4 by using the SAP administrator account. This is because these steps use the SAP System ID to stop, start, and re-enable replication manually.
Put the cluster into maintenance mode:
sudo crm configure property maintenance-mode=true
Check the SAP HANA DB and processes state:
Check the SAP-related processes that are running in the node. To do this, run HANA Database (HBD) info on every node. The SAP Admin should be able to confirm whether the required processes that are running on both the nodes and that the databases on both nodes remain synchronized.
HDB info
USER PID PPID %CPU VSZ RSS COMMAND a00adm 5183 5178 0.0 87684 1804 sshd: a00adm@pts/0 a00adm 5184 5183 0.0 14808 3624 \_ -sh a00adm 5994 5184 0.0 13200 1824 \_ /bin/sh /usr/sap/A00/HDB00/HDB info a00adm 6019 5994 0.0 26668 1356 \_ ps fx -U a00adm -o user,pid,ppid,pcpu,vsz,rss,args a00adm 5369 1 0.0 20932 1644 sapstart pf=/usr/sap/A00/SYS/profile/A00_HDB00_node1 a00adm 5377 5369 1.8 582944 292720 \_ /usr/sap/A00/HDB00/node1/trace/hdb.sapA00_HDB00 -d -nw -f /usr/sap/A00/HDB00/node1/daemon.ini pf=/usr/sap/A00/SYS/profile/A00_HDB00_node1 a00adm 5394 5377 9.3 3930388 1146444 \_ hdbnameserver a00adm 5548 5377 21.3 2943472 529672 \_ hdbcompileserver a00adm 5550 5377 4.4 2838792 465664 \_ hdbpreprocessor a00adm 5571 5377 91.6 7151116 4019640 \_ hdbindexserver a00adm 5573 5377 21.8 4323488 1203128 \_ hdbxsengine a00adm 5905 5377 18.9 3182120 710680 \_ hdbwebdispatcher a00adm 2104 1 0.0 428748 27760 /usr/sap/A00/HDB00/exe/sapstartsrv pf=/usr/sap/A00/SYS/profile/A00_HDB00_node1 -D -u a00adm a00adm 2004 1 0.0 31844 2352 /usr/lib/systemd/systemd --user a00adm 2008 2004 0.0 63796 2620 \_ (sd-pam)
If the SAP DB and services aren't active on the node, contact your SAP administrator to review and troubleshoot the issue:
sudo HDB stop
or
sudo sapcontrol -nr <SAPInstanceNo.> -function Stop
Replace
<SAPInstanceNo.>
with the number of the instance that has to be stopped.After the stop operation is finished, start HANA DB in the primary node and then in the secondary node:
sudo HDB start
or
sudo sapcontrol -nr <SAPInstanceNo.> -function Start
Typically, the stop and start operations of HANA DB should synchronize both nodes. If the database nodes are still not synchronized, the SAP administrator should troubleshoot the issue by reviewing SAP logs to make sure that the database nodes are correctly synchronized.
Note
The SAP administrator must determine which node should be designated as the primary and which as the secondary to make sure that no database data is lost in the process.
After you enable replication, check the system replication status by using SAP system admin account. In this case, the user admin account is
hn1adm
.sudo su - hn1adm -c "python /usr/sap/HN1/HDB03/exe/python_support/systemReplicationStatus.py"
| Host | Port | Service Name | Volume ID | Site ID | Site Name | Secondary | Secondary | Secondary | Secondary | Secondary | Replication | Replication | Replication | | | | | | | | Host | Port | Site ID | Site Name | Active Status | Mode | Status | Status Details | | ----- | ----- | ------------ | --------- | ------- | --------- | --------- | --------- | --------- | --------- | ------------- | ----------- | ----------- | -------------- | | node-1 | 30007 | xsengine | 2 | 1 | node-1 | node-2 | 30007 | 2 | node-2 | YES | SYNC | ACTIVE | | | node-1 | 30001 | nameserver | 1 | 1 | node-1 | node-2 | 30001 | 2 | node-2 | YES | SYNC | ACTIVE | | | node-1 | 30003 | indexserver | 3 | 1 | node-1 | node-2 | 30003 | 2 | node-2 | YES | SYNC | ACTIVE | | status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: node-1
In the secondary node, check the output to see if the mode is set to
SYNC
.this system is either not running or not primary system replication site Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: SYNC site id: 2 site name: node-2 active primary site: 1 primary masters: node-1
You can also verify the SAP HANA system replication by running the following command:
sudo SAPHanaSR-showAttr
Global cib-time maintenance -------------------------------------------- global Mon Jun 10 01:57:32 2024 false Hosts clone_state lpa_fh9_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ node-1 PROMOTED 1693237652 online logreplay node-2 4:P:master1:master:worker:master 150 DC1 syncmem PRIM 2.00.046.00.1581325702 node-1 node-2 DEMOTED 10 online logreplay node-1 4:P:master1:master:worker:master 100 DC2 syncmem SOK 2.00.046.00.1581325702 node-2
Exit the SAP Admin account, and remove the cluster from maintenance mode:
sudo crm configure property maintenance-mode=false
Make sure that the Pacemaker cluster resources are running successfully.
Third-party information disclaimer
The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.
Contact us for help
If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.