排查 SUSE 中的 Azure 隔离代理启动问题

适用于:✔️ Linux VM

本文列出了Microsoft Azure 隔离代理启动问题的常见原因,提供通过日志评审确定原因的指导,并为问题提供解决方法。

Azure 隔离代理的工作原理

Azure 隔离代理使用基于 Azure API 的 Python 程序,该程序用于 /usr/sbin/fence_azure_arm 执行 VM 关机或启动操作。 检测到失败的群集节点时,群集资源代理(RA)将调用此程序以及相应的参数来实现节点隔离(也称为 STONITH)。

如 SUSE - 创建 Azure 围栏代理 STONITH 设备中所述,自定义角色应向隔离代理提供执行以下操作的权限:

  • powerOff
  • start

如果虚拟机(VM)被检测为运行不正常,隔离代理会使用这些操作关闭 VM,然后重启它。

现象

Azure 隔离代理资源不会启动。 运行 sudo crm status 命令以检查群集资源的状态时,命令输出将报告“未知错误”。

下面是 crm 状态的示例输出:

Stack: corosync
Current DC: VM2 (version 2.0.1+20190417.13d370ca9-3.6.1-2.0.1+20190417.13d370ca9) - partition with quorum
Last updated: Mon Apr  6 13:58:59 2020
Last change: Mon Apr  6 13:58:53 2020 by root via crm_attribute on VM1
 
2 nodes configured
7 resources configured
 
Online: [ VM1 VM2 ]
 
Full list of resources:
 
Clone Set: cln_SAPHanaTopology_SS2_HDB00 [rsc_SAPHanaTopology_SS2_HDB00]
	 Started: [ VM1 VM2 ]
Clone Set: msl_SAPHana_SS2_HDB00 [rsc_SAPHana_SS2_HDB00] (promotable)
	 Main: [ VM1 ]
	 Sub: [ VM2 ]
Resource Group: g_ip_SS2_HDB00
	 rsc_ip_SS2_HDB00   (ocf::heartbeat:IPaddr2):       Started VM1
	 rsc_nc_SS2_HDB00   (ocf::heartbeat:azure-lb):      Started VM1
rsc_st_azure   (stonith:fence_azure_arm):      Stopped
 
Failed Resource Actions:
* rsc_st_azure_start_0 on VM2 'unknown error' (1): call=102, status=complete, exitreason='',
	last-rc-change='Mon Apr  6 13:50:57 2020', queued=0ms, exec=1790ms
* rsc_st_azure_start_0 on VM1 'unknown error' (1): call=121, status=complete, exitreason='',
	last-rc-change='Mon Apr  6 13:50:59 2020', queued=0ms, exec=1760ms

原因 1:终结点连接或凭据问题

若要解决此问题,请检查登录 /var/log/messages。 如果日志中出现包含“Azure 错误:AuthenticationFailed”的条目(如以下屏幕截图所示),则问题可能与终结点连接或凭据问题相关。

/var/log/messages
2021-03-15T20:23:15.441083+00:00 NodeName pacemaker-fenced[2550]:  warning: fence_azure_arm[21839] stderr: [ 2021-03-15 20:23:15,398 ERROR: Failed: Azure Error: AuthenticationFailed ]
2021-03-15T20:23:15.441260+00:00 NodeName pacemaker-fenced[2550]:  warning: fence_azure_arm[21839] stderr: [ Message: Authentication failed. ]

解决方法

  1. 确保端口 443 上存在到以下 Azure 管理 API 终结点的出站连接:

    • management.azure.com
    • login.microsoftonline.com

    可以使用 telnet, or curl' 测试连接nc1, (根据需要替换终结点值):

    nc -z -v <endpoint> 443
    
    telnet <endpoint> 443
    
    curl -v telnet://<endpoint>:443
    
  2. 确保为 STONITH 资源设置有效的用户名和密码。 STONITH 资源失败的主要原因之一是使用服务主体时用户名或密码的值无效。 可以使用命令测试值 fence_azure_arm ,如以下示例所示。 若要设置 STONITH 资源的用户名和密码,请参阅 创建 Azure 围栏代理 STONITH 设备

    sudo /usr/sbin/fence_azure_arm --action=list --username='<user name>' --password='<password>' --tenantId=<tenant ID> --resourceGroup=<resource group> 
    

    此命令应返回群集中 VM 的节点名称。 如果命令未成功,请将其与 -v 标志一起重新运行,以启用详细输出和 -D 标志,从而启用调试输出,如以下示例所示:

    sudo /usr/sbin/fence_azure_arm --action=list --username='<user name>' --password='<password>' --tenantId=<tenant ID> --resourceGroup=<resource group> -v -D /var/tmp/debug-fence.out 
    

    如果在 STONITH 资源中使用托管标识,请运行以下命令:

    sudo /usr/sbin/fence_azure_arm --action=list --msi --resourceGroup=<resource group> -v -D /var/tmp/debug-fence.out
    

    注意

    在命令中,根据需要替换<user name><password><tenant ID><resource group>值。

原因 2:身份验证失败

检查登录 /var/log/messages。 如果日志中出现包含“unauthorized_client”的条目,如以下示例所示,则问题可能与身份验证失败相关。

/var/log/messages
2020-04-06T10:06:47.779470+00:00 VM1 pacemaker-controld[29309]: notice: Result of probe operation for rsc_st_azure on VM1: 7 (not running)
2020-04-06T10:06:51.045519+00:00 VM1 pacemaker-execd[29306]: notice: executing - rsc:rsc_st_azure action:start call_id:52
2020-04-06T10:06:52.826702+00:00 VM1 /fence_azure_arm: Failed: AdalError: Get Token request returned http error: 400 and server response: {"error":"unauthorized_client","error_description":"AADSTS700016: Application with identifier '<app-id>'
was not found in the directory '<directory-id>. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant.
You may have sent your authentication request to the wrong tenant.\r\nTrace ID: <directory-id>\r\nCorrelation ID: 7ID\r\nTimestamp:2020-04-06 10:06:52Z","error_codes":[700016],"timestamp":"2020-04-06 10:06:52Z","trace_id":"<directory-id>",
"correlation_id":"ID","error_uri":"https://login.microsoftonline.com/error?code=700016 "}

解决方法

从Azure 门户验证 Microsoft Entra ID 应用租户 ID、应用程序 ID、登录名和密码详细信息。 执行以下步骤:

  1. 验证或更新 ID 后,在群集中重新配置隔离代理:

    sudo crm configure property maintenance-mode=true
    sudo crm configure edit <fencing agent resource>
    
  2. 根据需要更改参数,并保存更改:

     sudo crm configure property maintenance-mode=false
    
  3. 检查群集状态以验证隔离代理问题是否已修复:

    crm status
    

原因 3:权限不足

检查登录 /var/log/messages。 如果日志中出现包含“客户端无权执行操作”的条目,如以下示例所示,则问题可能与权限不足有关:

/var/log/messages
Apr 2 00:49:56 VM1 fence_azure_arm: Please use '-h' for usage
Apr 2 00:49:57 VM1 stonith-ng[105424]: warning: fence_azure_arm[109393] stderr: [ 2020-04-02 00:49:56,978 ERROR: Failed: Azure Error: AuthorizationFailed ]
Apr 2 00:49:57 VM1 stonith-ng[105424]: warning: fence_azure_arm[109393] stderr: [ Message: The client 'client-id' with object id '<client-id>' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read' over scope '/subscriptions/<sub-id>/resourceGroups/<rg-name>/providers/Microsoft.Compute' or the scope is invalid.If access was recently granted, please refresh your credentials. ]

解决方法

  1. 为隔离代理 创建自定义角色,以验证是否已为隔离代理配置自定义角色定义。
  2. 验证隔离代理是否已在受影响的 VM 上分配必要的自定义角色。 如果未为代理分配角色,请使用访问控制将该角色分配给 VM。
  3. 运行 crm status 以检查群集状态,确保已解决隔离代理问题。

原因 4:SSL 握手失败

如果包含“SSLError: HTTPSConnectionPool(host='management.azure.com ', port=443): 最大重试次数超过 URL”出现在日志中,如以下示例所示,问题可能与 SSL 握手失败有关:

/var/log/messages
warning: fence_azure_arm[28114] stderr: [ 2021-06-24 07:59:29,832 ERROR: Failed: Error occurred in request., SSLError: HTTPSConnectionPool(host='management.azure.com ', port=443): Max retries exceeded with url: /subscriptions/<sub-id>/resourceGroups/<RG-name>/providers/Microsoft.Compute/virtualMachines?api-version=2019-03-01 (Caused by SSLError(SSLError('bad handshake: SysCallError(-1, 'Unexpected EOF')',),)) ]

解决方法

  1. 使用 openssl以下命令测试受影响节点的连接性:

    openssl s_client -connect management.azure.com:443
    
  2. 检查输出是否缺少完整的证书握手,如以下示例所示:

    CONNECTED(00000003)
    write:errno=0
    ---
    no peer certificate available
    ---
    No client certificate CA names sent
    ---
    SSL handshake has read 0 bytes and written 176 bytes
    Verification: OK
    ---
    New, (NONE), Cipher is (NONE)
    Secure Renegotiation IS NOT supported
    Compression: NONE
    Expansion: NONE
    No ALPN negotiated
    SSL-Session:
        Protocol  : TLSv1.2
        Cipher    : 0000
        Session-ID:
        Session-ID-ctx:
        Master-Key:
        PSK identity: None
        PSK identity hint: None
        SRP username: None
        Start Time: 1625235527
        Timeout   : 7200 (sec)
        Verify return code: 0 (ok)
        Extended master secret: no
    

    这些错误很可能是由运行数据包检查或修改透明层套接字(TLS)连接的网络设备或防火墙以中断证书验证的方式引起的。 此外,这些问题可能是最大传输单元(MTU)达到其大小限制的原因。

  3. 如果Azure 防火墙位于节点前面,请确保将以下标记添加到应用程序或网络规则中:

    • 应用程序规则:ApiManagement、AppServiceManagement、AzureCloud
    • 网络规则:AppServiceEnvironment

原因 5:缺少 fence-agents-azure-arm

检查登录 /var/log/messages。 以下日志条目指示围栏代理无法读取或查找系统中的 fence-agents-azure-arm 包。

/var/log/messages
2024-09-03T02:30:36.264033+00:00 node1 lrmd[5772]:    error: Unknown fence agent: fence_azure_arm
2024-09-03T02:30:36.271111+00:00 node1 stonith-ng[5771]:   error: Unknown fence agent: fence_azure_arm
2024-09-03T02:30:36.271426+00:00 node1 stonith-ng[5771]:   error: Agent fence_azure_arm not found or does not support meta-data: Invalid argument (22)
2024-09-03T02:30:36.271620+00:00 node1 stonith-ng[5771]:   error: Could not retrieve metadata for fencing agent fence_azure_arm
2024-09-03T02:30:36.271800+00:00 node1 stonith-ng[5771]: warning: Cannot execute '/usr/sbin/fence_azure_arm': No such file or directory (2)
2024-09-03T02:30:37.271549+00:00 node1 stonith-ng[5771]: warning: Cannot execute '/usr/sbin/fence_azure_arm': No such file or directory (2)
2024-09-03T02:30:39.271843+00:00 node1 stonith-ng[5771]: message repeated 2 times: [  warning: Cannot execute '/usr/sbin/fence_azure_arm': No such file or directory (2)]
2024-09-03T02:30:39.272240+00:00 node1 stonith-ng[5771]:  notice: Operation 'monitor' [0] for device 'rsc_st_azure' returned: -61 (No data available)
2024-09-03T02:30:39.272486+00:00 node1 lrmd[5772]:   notice: finished - rsc:rsc_st_azure action:start call_id:67  exit-code:1 exec-time:3008ms queue-time:0ms
2024-09-03T02:30:39.272722+00:00 node1 crmd[5776]:    error: Unknown fence agent: fence_azure_arm
2024-09-03T02:30:39.272970+00:00 node1 crmd[5776]:    error: Agent fence_azure_arm not found or does not support meta-data: Invalid argument (22)
2024-09-03T02:30:39.273207+00:00 node1 crmd[5776]:  warning: Failed to get metadata for rsc_st_azure (stonith:(null):fence_azure_arm)
2024-09-03T02:30:39.273439+00:00 node1 crmd[5776]:    error: Result of start operation for rsc_st_azure on node1: Error
2024-09-03T02:30:39.274704+00:00 node1 crmd[5776]:  warning: Action 9 (rsc_st_azure_start_0) on node1 failed (target: 0 vs. rc: 1): Error
2024-09-03T02:30:39.274984+00:00 node1 crmd[5776]:   notice: Transition 91369 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-2563.bz2): Complete
2024-09-03T02:30:39.307439+00:00 node1 pengine[5775]:  warning: Processing failed start of rsc_st_azure on node1: unknown error
2024-09-03T02:30:39.307786+00:00 node1 pengine[5775]:  warning: Processing failed start of rsc_st_azure on node1: unknown error
/var/log/messages
2024-08-20T13:28:24.043272+00:00 node1 crmd[6692]:    error: Unknown fence agent: fence_azure_arm
2024-08-20T13:28:24.043453+00:00 node1 crmd[6692]:    error: Agent fence_azure_arm not found or does not support meta-data: Invalid argument (22)
2024-08-20T13:28:24.043554+00:00 node1 crmd[6692]:  warning: Failed to get metadata for rsc_st_azure (stonith:(null):fence_azure_arm)
2024-08-20T13:28:24.044608+00:00 node1 crmd[6692]:    error: Unknown fence agent: fence_azure_arm
2024-08-20T13:28:24.044711+00:00 node1 crmd[6692]:    error: Agent fence_azure_arm not found or does not support meta-data: Invalid argument (22)
2024-08-20T13:28:24.044833+00:00 node1 crmd[6692]:  warning: Failed to get metadata for rsc_st_azure (stonith:(null):fence_azure_arm)
2024-08-20T13:28:26.160617+00:00 node1 crmd[6692]:    error: Unknown fence agent: fence_azure_arm
2024-08-20T13:28:26.160895+00:00 node1 crmd[6692]:    error: Agent fence_azure_arm not found or does not support meta-data: Invalid argument (22)
2024-08-20T13:28:26.161008+00:00 node1 crmd[6692]:  warning: Failed to get metadata for rsc_st_azure (stonith:(null):fence_azure_arm)
2024-08-20T13:28:26.162073+00:00 node1 crmd[6692]:    error: Unknown fence agent: fence_azure_arm
2024-08-20T13:28:26.162193+00:00 node1 crmd[6692]:    error: Agent fence_azure_arm not found or does not support meta-data: Invalid argument (22)
2024-08-20T13:28:26.162294+00:00 node1 crmd[6692]:  warning: Failed to get metadata for rsc_st_azure (stonith:(null):fence_azure_arm)

解决方法

SUSE 已根据 Python 3.11 重新生成 Azure 围栏代理包 fence-agents-azure-arm 。 有关详细信息,请参阅 安装 Python 3.11 解释器后 Azure 隔离代理无法启动。 若要解决此问题,请按照以下步骤安装包:

  1. 将群集置于维护模式:
      sudo crm configure property maintenance-mode=true
    
  2. 在群集的所有节点上安装以下包:
     sudo zypper in fence-agents-azure-arm
    
  3. 从维护模式中删除群集:
     sudo crm configure property maintenance-mode=false
    
  4. 确保已解决隔离代理问题。 为此,请运行 crm status 以检查群集状态。

后续步骤

如果需要其他帮助,请使用以下说明提出支持请求:

联系我们寻求帮助

如果你有任何疑问或需要帮助,请创建支持请求联系 Azure 社区支持。 你还可以将产品反馈提交到 Azure 反馈社区

提交请求时,请附加用于故障排除的副本 debug-fence.out

第三方信息免责声明

本文中提到的第三方产品由 Microsoft 以外的其他公司提供。 Microsoft 不对这些产品的性能或可靠性提供任何明示或暗示性担保。