Review alerts on Azure Stack Edge
APPLIES TO: Azure Stack Edge Pro - GPUAzure Stack Edge Pro 2Azure Stack Edge Pro RAzure Stack Edge Mini R
This article describes how to view alerts and interpret alert severity for events on your Azure Stack Edge devices. The alerts generate notifications in the Azure portal. The article includes a quick-reference for Azure Stack Edge alerts.
Overview
The Alerts blade for an Azure Stack Edge device lets you review Azure Stack Edge device–related alerts in real-time. From this blade, you can centrally monitor the health issues of your Azure Stack Edge devices and the overall Microsoft Azure Stack Edge solution.
The initial display is a high-level summary of alerts at each severity level. You can drill down to see individual alerts at each severity level.
Alert severity levels
Alerts have different severity levels, depending on the impact of the alert situation and the need for a response to the alert. The severity levels are:
- Critical – This alert is in response to a condition that is affecting the successful performance of your system. Action is required to ensure that Azure Stack Edge service is not interrupted.
- Warning – This condition could become critical if not resolved. You should investigate the situation and take any action required to resolve the issue.
- Informational – This alert contains information that can be useful in tracking and managing your system.
Configure alert notifications
You can also send alert notifications by email for events on your Azure Stack Edge devices. To manage these alert notifications, you create action rules. The action rules can trigger or suppress alert notifications for device events within a resource group, an Azure subscription, or on a device. For more information, see Using action rules to manage alert notifications.
Alerts quick-reference
The following tables list some of the Azure Stack Edge alerts that you might run across, with descriptions and recommended actions. The alerts are grouped in the following categories:
- Cloud connectivity alerts
- Edge compute alerts
- Local Azure Resource Manager alerts
- Performance alerts
- Storage alerts
- Security alerts
- Key vault alerts
- Hardware alerts
- Update alerts
- Virtual machine alerts
Note
In the alerts tables below, some alerts are triggered by more than one event type. If the events have different recommended actions, the table has an alert entry for each of the events.
Cloud connectivity alerts
The following alerts are raised by a failed connection to an Azure Stack Edge device or when no heartbeat is detected.
Alert text | Severity | Description / Recommended action |
---|---|---|
Could not connect to the Azure. | Critical | Check your internet connection. In the local web UI of the device, go to Troubleshooting > Diagnostic tests. Run the Internet connectivity diagnostic test. |
Lost heartbeat from your device. | Critical | If your device is offline, then the device is not able to communicate with the Azure service. This could be due to one of the following reasons:
|
Edge compute alerts
The following alerts are raised for Edge compute or the compute acceleration card, which can be a Graphical Processing Unit (GPU) or Vision Processing Unit (VPU) depending on the device model.
Alert text | Severity | Description / Recommended action |
---|---|---|
Edge compute is unhealthy. | Critical | Restart your device to resolve the issue. In the local web UI of your device, go to Maintenance > Power settings and click Restart. If the problem persists, contact Microsoft Support. |
Edge compute ran into an issue with name resolution. | Critical | Ensure that your DNS server {15} is online and reachable. If the problem persists, contact your network administrator. |
Compute acceleration card configuration has an issue.* | Critical | We've detected an unsupported compute acceleration card configuration. Before you contact Microsoft Support, follow these steps:
|
Compute acceleration card configuration has an issue.* | Critical | We've detected an unsupported compute acceleration card. Before you contact Microsoft Support, follow these steps:
|
Compute acceleration card configuration has an issue.* | Critical | This may be due to one of the following reasons:
If the issue persists, do the following:
|
Compute acceleration card configuration has an issue.* | Critical | This is due to an internal error. Before you contact Microsoft Support, follow these steps:
|
Compute acceleration card configuration has an issue.* | Critical | As your Azure IoT Machine Learning module starts up, you may see this transient issue. Wait a few minutes to see if the issue resolves. If the issue persists, do the following:
|
Compute acceleration card driver software is not running. | Critical | This is due to an internal error. Before you contact Microsoft Support, follow these steps:
|
Compute acceleration card on your device is unhealthy. | Critical | This is due to an internal error. Before you contact Microsoft Support, follow these steps:
|
Shutting down the compute acceleration card as the card temperature has exceeded the operating limit! | Critical | This is due to an internal error. Before you contact Microsoft Support, follow these steps:
|
Compute acceleration card performance is degraded. | Warning | This might be because the compute acceleration card has a high usage. Consider stopping or reducing the workload on the Azure IoT Machine Learning module. Before you contact Microsoft Support, follow these steps:
|
Compute acceleration card temperature is rising. | Warning | This might be because the compute acceleration card has a high usage. Consider stopping or reducing the workload on the Azure IoT Machine Learning module. Before you contact Microsoft Support, follow these steps:
|
Edge compute couldn’t access data on share {16}. | Warning | Verify that you can access share {16}. If you can access the share, it indicates an issue with Edge compute. To resolve the issue, restart your device. In the local web UI of your device, go to Maintenance > Power settings and click Restart. If the issue persists, contact Microsoft Support. |
Edge compute couldn’t access data on share {16}. This may be because the share doesn’t exist anymore. | Warning | If the share does not {16} exist, restart your device to resolve the issue. In the local web UI of your device, go to Maintenance > Power settings and click Restart. If the problem persists, contact Microsoft Support. |
IoT Edge agent is not running. | Warning | Restart your device to resolve the issue. In the local web UI of your device, go to Maintenance > Power settings and click Restart. If the problem persists, contact Microsoft Support.md). |
IoT Edge service is not running. | Warning | Restart your device to resolve the issue. In the local web UI of your device, go to Maintenance > Power settings and click Restart. If the problem persists, contact Microsoft Support.md). |
Storage used by Edge compute is getting full. | Warning | Contact Microsoft Support for next steps. |
Your Edge compute module {20} is disconnected from IoT Edge | Warning | Restart your device to resolve the issue. In the local web UI of your device, go to Maintenance > Power settings and click Restart. If the problem persists, contact Microsoft Support. |
Your Edge compute module(s) may be using a local mount point {15} that is different than the local mountpoint used by a share. | Warning | Ensure that the local mountpoint {15} used is the one that is mapped to the share.
If the alert persists, contact Microsoft Support. |
* This alert is triggered by more than one event type, with different recommended actions.
Local Azure Resource Manager (ARM) alerts
The following alerts are raised by the local Azure Resource Manager (ARM), which is used to connect to the local APIs on Azure Stack Edge devices.
Alert text | Severity | Description / Recommended action |
---|---|---|
Specified service authentication certificate with thumbprint '{0}' does not have a private key | Critical | If the issue persists, contact Microsoft Support. |
Certificate with thumbprint '{0}' at location '{1}' is not found or not accessible. | Critical | If the issue persists, contact Microsoft Support. |
Unable to connect endpoint: '{0}' | Critical | If the issue persists, contact Microsoft Support. |
Error occurred during web request: '{0}' | Critical | If the issue persists, contact Microsoft Support. |
Request timed out for url: '{0}' | Critical | If the issue persists, contact Microsoft Support. |
Unable to get Token using login endpoint '{0}' for resource '{1}' | Critical | If the issue persists, contact Microsoft Support. |
Unknown error occurred. ErrorCode:'{0}'. Details: '{1}' | Critical | If the issue persists, contact Microsoft Support. |
Could not start the VM service on the device. | Critical | If you see this alert, contact Microsoft Support. |
VM service is not running on the device. | Critical | If you see this alert, contact Microsoft Support. |
Performance alerts
The following alerts indicate performance issues related to storage or to CPU, memory, or disk usage on an Azure Stack Edge device.
Alert text | Severity | Description / Recommended action |
---|---|---|
The CPU utilization on your device has exceeded the threshold for an extended duration. | Critical | Reduce workloads or modules running on your device. If the problem persists, contact Microsoft Support. |
The CPUs reserved for the virtual machines on your device exceeds the configured threshold. | Critical | Take one of the following steps:
|
The memory used by the virtual machines on your device exceeds the configured threshold. | Critical | Take one of the following steps:
|
The data volume on the device is {0}% full. Writes into the device are being throttled. | Critical |
|
The memory used by the virtual machines on node {0} of your device exceeds the configured threshold. | Critical | The device will try to balance load across other nodes. Consider reducing some virtual machine workloads from your device. If the problem persists, contact Microsoft Support. |
Your device is almost out of storage space. If a disk fails, then you may not be able to restore data on this device. | Critical | Delete data to free up capacity on your device. |
The CPU utilization on node {0} of your device has exceeded the threshold for an extended duration.* | Warning | The device will try to balance load across other nodes. Consider reducing some virtual machine workloads from your device. If the problem persists, contact Microsoft Support. |
The CPU utilization on node {0} of your device has exceeded the threshold for an extended duration.* | Warning | Reduce workloads or modules running on your device. If the problem persists, contact Microsoft Support. |
The node {0} on your device is using more memory than expected. | Warning | If the problem persists, contact Microsoft Support. |
The CPUs reserved for the virtual machines on node {0} of your device exceeds the configured threshold. | Warning | Take one of the following steps:
|
The memory used by the virtual machines on your device exceeds the configured threshold. | Warning | Take one of the following steps:
|
Too many virtual machines are active on node {0} of your device. | Warning | The device will try to balance load across other nodes. Consider reducing some virtual machine workloads from your device. If the problem persists, contact Microsoft Support. |
The virtual hard disk {0} is nearing its capacity. | Warning | Delete some data to free capacity. |
* This alert is triggered by more than one event type, with different recommended actions.
Storage alerts
The following alerts are for issues that occur when accessing or uploading data to Azure Storage.
Alert text | Severity | Description / Recommended action |
---|---|---|
Could not access volume {0}.* | Critical | This could happen when the volume is offline, or too many drives or servers have failed or are disconnected. Take the following steps:
|
Could not access volume {0}.* | Critical | In the local web UI of the device, go to Troubleshooting > Diagnostic tests, and click Run diagnostic tests. Resolve the reported issues. If the issue persists, contact Microsoft Support. |
Could not find volume {0}.* | Critical | If the issue persists, contact Microsoft Support. |
Could not find volume {0}.* | Critical Warning |
Expand the volume or migrate workloads to other volumes. |
Some data on this volume {0} is not fully resilient. It remains accessible. | Informational | Restoring resiliency of the data. |
Could not upload {0} files(s) from share {1}. | Critical | This could be due to one of the following reasons:
|
Could not connect to the storage account '{0}'.* | Critical | This may be because the storage account access keys have been regenerated. If the keys have been regenerated, you will need to synchronize the new keys. To fix the issue, in the Azure portal go to Shares, select the share, and refresh the storage keys. |
Could not connect to the storage account '{0}'.* | Critical | This may be due to Internet connectivity issues. The device is not able to communicate with the storage account service. In the local web UI of the device, go to Troubleshooting > Diagnostic tests and click Run diagnostic tests. Resolve the reported issues. |
The device has {0} files. A maximum of {1} files are supported. | Critical | Consider deleting some files from the device. |
Low throughput to and from Azure Storage detected. | Warning | In the local web UI of the device, go to Troubleshooting > Diagnostic tests and click Run diagnostic tests. Resolve the reported issues. If the issue persists, contact Microsoft Support. |
* This alert is triggered by more than one event type, with different recommended actions.
Security alerts
The following alerts signal access issues related to passwords, certificates, or keys, or report attempts to access an Azure Stack Edge device.
Alert text | Severity | Description / Recommended action |
---|---|---|
{0} from {1} expires in {2} days. | Critical Warning |
Check your certificate and upload a new certificate before the expiration date. |
{0} of type {1} is not valid. | Critical | Check your certificate. If the certificate is not valid, upload a new certificate. |
Internal certificate rotation failure | Critical | Couldn't rotate the internal certificates. If services are impaired, contact Microsoft Support. |
Could not login '{0}'. Number of failed attempts : '{1}'. | Critical Warning Informational |
Make sure that you have entered the correct password. An authorized user may be attempting to connect to your device with an incorrect password. Verify that these attempts were from a legitimate source. If you continue to see failed login attempts, contact your network administrator. |
Rotate SED key protector on node {0}, did not complete in time. | Warning | The attempt to rotate SED key protector to the new default has not completed in time. Please check if node and physical disks are in healthy state. System will retry again. |
Device password has changed | Informational | The device administrator password has changed. This is a required action as part of the first-time device setup or regular password reset. No further action is required. |
A support session is enabled. | Informational | This is an information alert to ensure that administrators can ensure that the enabling the support session is legitimate. No action is needed. |
A support session has started. | Informational | This is an information alert to ensure that administrators can ensure that the support session is legitimate. No action is needed. |
Key Vault alerts
The following alerts relate to your Azure Key Vault configuration.
Alert text | Severity | Description / Recommended action |
---|---|---|
Key Vault is not configured* | Critical Warning |
|
Key Vault is not configured* | Warning | Configure the Key Vault for your Azure Stack Edge resource. For detailed steps, see Create a key vault. |
Key Vault is deleted | Critical | If the key vault is deleted and the purge protection duration of 90 days hasn't elapsed, follow the steps to Recover your key vault. |
Couldn’t retrieve secret(s) from the Key Vault | Critical |
|
Couldn’t access the Key Vault | Critical |
|
* This alert is triggered by more than one event type, with different recommended actions.
Hardware alerts
The following alerts indicate an issue with a hardware component, such as physical disk, NIC, or power supply unit, on an Azure Stack Edge device.
Alert text | Severity | Description / Recommended action |
---|---|---|
{0} on {1} has failed. | Critical | This is because the power supply is not connected properly or has failed. Take the following steps to resolve this issue:
|
Could not reach {1}. | Critical | If the controller is turned off, restart the controller. Make sure that the power supply is functional. For information on monitoring the power supply LEDs, go to https://www.microsoft.com/. If the issue persists, contact Microsoft Support. |
{0} is powered off. | Warning | Connect the Power Supply Unit to a Power Distribution Unit. |
One or more device components are not working properly. | Critical | Contact Microsoft Support for next steps. |
Could not replace {0}. | Warning | Contact Microsoft Support for next steps. |
Started the replacement of {0}. | Informational | No action is required from you. |
Successfully replaced {0} | Informational | No action is required from you. |
{0} is disconnected. | Warning | Verify that '{0}' is cabled properly and the network interface is up. |
{0} has failed.* | Critical | The device needs to be replaced. Contact Microsoft Support to replace the device. |
{0} has failed.* | Critical | Verify that '{0}' is cabled properly and the network interface is up. In the local web UI of the device, go to Troubleshooting > Diagnostic tests and click Run diagnostic tests. Resolve the reported issues. If the issue persists, contact Microsoft Support at https://aka.ms/getazuresupport. |
Some data on the cache physical disk {0} on node {1} can't be read, preventing us from moving it onto capacity drives. | Warning | Replace the physical disk. |
The cache physical disk {0} on node {1} failed some reads or writes, so to protect your data we've moved it onto capacity drives. | Warning | Replace the physical disk. |
The physical disk {0} on node {1} failed to read or write multiple times in the last couple of days. If this keeps happening, it could mean that the drive is malfunctioning, damaged, or beginning to fail. | Warning | If the issue persists, consider replacing the physical disk. |
The physical disk {0} on node {1} has issues with reads or writes. | Warning | If the issue persists, consider replacing the physical disk. |
The physical disk {0} on node {1} has reached 100% of its rated write endurance and is now read-only, meaning it cannot perform any more writes. | Warning | Consider replacing the physical disk. |
The physical disk {0} on node {1} has failed. | Warning | Replace the physical disk. |
The physical disk {0} on node {1} has issues.* | Warning | The physical disk has encountered multiple bad blocks during writes in the last couple of days. This could mean that the drive is malfunctioning, damaged, or beginning to fail. If the issue persists, consider replacing the physical disk. |
The physical disk {0} on node {1} has issues.* | Warning | The physical disk {0} on node {1} encountered multiple bad blocks during writes in the last couple of days. This could mean that the drive is malfunctioning, damaged, or beginning to fail. If the issue persists, consider replacing the physical disk. |
The physical disk {0} on node {1} has problems. | Warning | If the issue persists, consider replacing the physical disk. |
The physical disk {0} on node {1} is wearing out. It may become read-only, meaning it cannot perform any more writes, when it reaches 100% of its rated endurance. | Warning | Consider replacing the physical disk. |
The physical disk {0} on node {1} is performing slowly. | Warning | If the issue persists, consider replacing the physical disk. |
There is no connectivity to the physical disk {0} on node {1}. | Warning | Make sure that the physical disk is working and is properly connected. |
{0} has failed or is missing. | Critical | Your device is degraded. The device will become unhealthy if one more disk fails. Contact Microsoft Support to order a replacement disk. Replace the disk. |
The physical disk {0} on node {1} could fail soon. | Warning | Replace the physical disk. |
A disk replacement operation is being performed. PercentComplete = {0}, Disk = {2}. | Critical | This is an informational event. No action is required at this time. |
The physical disk {0} on node {1} has failed. | Warning | Replace the physical disk. |
The physical disk {0} on node {1} is not responding intermittently. | Warning | Replace the physical disk. |
The physical disk {0} on node {1} does not have current default SED key protector set on it. | Warning | System will attempt to update the SED key protector to latest. If issue persists, check if drive is in healthy state. |
The physical disk {0} on node {1} has failed rotation of SED key protector. | Warning | The attempt to rotate SED key protector to the new default has failed. Please check if physical disk is in healthy state. System will retry again, if issue persists, please replace the drive. |
The physical disk {0} on node {1} has unrecognized metadata. | Critical | The disk may contain data from an unknown storage pool. Replace this disk with a Microsoft supported disk for your device that does not contain any data. |
The physical disk {0} on node {1} is running an unsupported firmware version. | Warning | Contact Microsoft Support. |
The physical disk {0} on node {1} is not a supported disk. | Warning | Replace the physical disk with supported hardware. |
The temperature sensor on the motherboard of server {0} has raised a warning. | Warning | Check the node temperature. |
* This alert is triggered by more than one event type, with different recommended actions.
Update alerts
The following alerts relate to Microsoft updates and firmware updates for physical device components.
Alert text | Severity | Description / Recommended action |
---|---|---|
Could not download the updates. Error message : '{0}'. | Critical | {0} |
Could not install the updates. Error message : '{0}'. | Critical | Resolve the error : {0} |
Could not scan for updates. Error message : '{0}'. | Critical | Resolve the error : {0} |
{0} update(s) available. | Informational | We strongly recommend that you install these updates. For more information, refer How to install updates. |
Could not update the disk firmware. | Critical | Contact Microsoft Support for next steps. |
Could not update the firmware on physical disk {0} on node {1}. | Warning | Contact Microsoft Support. |
Could not make progress as a firmware rollout is in progress. | Warning | Verify all storage spaces are healthy, and that no fault domain is currently in maintenance mode. |
Canceled the firmware rollout due to unreadable or unexpected version information after applying the firmware update. | Warning | Restart the firmware rollout after the firmware issue is resolved. |
Canceled the firmware rollout as firmware update on too many physical disks failed. | Warning | Restart the firmware rollout after the firmware issue is resolved. |
Started a disk firmware update. | Informational | No action is required from you. |
Successfully updated the disk firmware. | Informational | No action is required from you. |
A physical disk firmware rollout is in progress. PercentComplete = {0}. | Informational | This is an informational event. No action is required at this time. |
Virtual machine alerts
The following alerts are raised for virtual machines on an Azure Stack Edge device.
Alert text | Severity | Description / Recommended action |
---|---|---|
The virtual machine {0} is not healthy. | Warning | To troubleshoot the virtual machine, see https://aka.ms/vmtroubleshoot. |
The virtual machine {0} is not operating properly. | Warning | To troubleshoot the virtual machine, see https://aka.ms/vmtroubleshoot. |
Your virtual machine {0} is not running. | Warning | If the issue persists, delete and redeploy the virtual machine. |
The guest operating system in the virtual machine {0} is unhealthy. | Warning | To troubleshoot the virtual machine, see https://aka.ms/vmtroubleshoot. |
Your virtual machine {0} is almost out of memory. | Warning | Reduce the memory usage on your virtual machine. |
Your virtual machine {0} is not responding to host requests. | Warning | To troubleshoot the virtual machine, see https://aka.ms/vmtroubleshoot. |