Best practices for remote troubleshooting of Azure Sphere devices

As you manage your devices remotely, sometimes you may encounter issues that prevent your devices from operating properly. This article includes a list of questions and flowcharts to help you triage your situation and determine what went wrong. Going through this guide can reduce downtime of your devices and help you quickly self-troubleshoot to get your devices back to operating as they should.

Note

Here is a preliminary checklist addressing connectivity infrastructure that you should walk through:

  1. Ensure your network infrastructure is configured to allow the necessary endpoints for Azure Sphere devices by following the instructions in Azure Sphere's OS networking requirements:
    1. To confirm the endpoints are properly configured, run the diagnostic checks in Solution design considerations.
    2. To determine if a device is connecting to Azure Sphere Security Services (AS3), run the command az sphere device list. Check the lastUpdateRequestUTC field, which provides the last time the device requested for an update from Azure Sphere Security Services.
    3. If you are running custom NTP, ensure that your NTP server is up and its time is with 24 hours of global time and is set to the correct timezone.
  2. Check your application's Wi-Fi configuration settings.
  3. Check IoT Hub:
    1. Ensure your Azure Sphere Security Service certificate on IoT Hub is up to date.
    2. Check that IoT Hub servers are operational.
  4. Check that your devices are receiving enough power per your hardware solution's specifications.
  5. Check that Microsoft's NCSI service is up and connecting. Refer to the following link: (http://www.msftconnecttest.com/connecttest.txt).

Before checking other aspects of device health, consider the following preliminary questions:

How many devices are impacted? Is this the only device, or are there other devices?

  1. If a small number of devices are impacted, obtain their device ID and run az sphere catalog download-error-report in the CLI and analyze the report. See Collect and interpret error data for information about how to interpret the report.
  2. If there are multiple devices, continue onto the next section.

Triage device health

The following are some areas of consideration to help you triage the situation.

Check your devices' connectivity by tracing through the following flowchart: connectivity flowchart.

First, check your firewall settings. If you manage your firewall settings, check that your networking settings are compliant with Sphere's requirements. For more information, see Troubleshoot network problem. Follow the guidance in Azure Sphere OS networking requirements to ensure compliance. If you do not manage your firewall settings, reach out to your firewall administrator for further guidance.

Next, look at northbound connectivity. If you use Wi-Fi to connect to the internet, are your devices in a crowded area? If they are, ensure that your settings are using targeted scan. For more on targeted scan, see WifiConfig_SetTargetedScanEnabled Function. If your devices are not in a crowded area, reach out to Microsoft Support for further guidance. Do you use EAP-TLS? If yes, check with your provider on the lifecycle certificate management and refer to EAP-TLS certificate renewal. If you do not use EAP-TLS, ensure your SSID or password haven't been changed.

If you use cellular to connect to the internet, ask your systems integrators or cellular service provider if your devices are showing up on the network.

What's the scope of the issue? Trace through the following flowchart: Scale of problem flowchart.

How many devices are encountering problems? If it's just a few devices that are impacted, first, check the Connectivity flowchart. Next, check the physical environment of the devices: Are the devices unplugged or has some change been made on the devices' hardware? If the devices are plugged in and no change has been made on the devices' hardware, get 2 to 3 device IDs and check the catalog error logs either by running the command az sphere catalog download-error-report or by visiting the Azure Portal and navigating to the resource menu, and selecting the Device insights tab, under the Monitoring heading. Check the Description field. If the description includes any of the following, check the customer application logs for further guidance:

  • AppCrash
  • AppUpdate
  • AppExit

However, if the description includes any of the following, reach out to Microsoft Support:

  • SystemAppCrash
  • Kernel Panic
  • Kernel Oops

If all devices have been affected, follow these steps:

  1. Have devices recently taken an OS update? If they have, contact Microsoft Support. If they haven't taken an OS update, refer to the Connectivity flowchart. Depending on which software channel feed your device group is part of, you may have received an OS update notification. For more information on OS feeds, see Azure Sphere OS feeds.
  2. Have devices recently taken an application update? If they have, redeploy or rollback to a previous version of the application. If they haven't, contact Microsoft Support. For more information on over-the-air updates, please refer to About over-the-air updates.

In the case that you can get physical access to the devices

If you're able to get physical access to the devices, you may wish to take these local troubleshooting steps:

  1. Can you rule out connectivity issues at that specific location? For example, is the building having issues with connectivity?
  2. Check the ethernet section of the Connectivity flowchart: connectivity flowchart. If you use ethernet to connect to the internet, check your switch port. If the switch port is lighting up, power cycle the device. If they are not lighting up, check your firewall settings.
  3. Are the devices unplugged, or has some change been made on the devices' hardware? For example, are the sensors overexerted, or is the USB connector broken?
  4. Run the command az sphere get-support-data.