Collect and interpret error data

Error and event data is uploaded to the Azure Sphere Security Service daily. Anyone who has access to a particular catalog can then download the data for that catalog. The report covers all the devices in the catalog.

Each report contains a maximum of 1,000 events or 14 days of data, whichever is reached first. Data can be written to a file or piped to a script or application. The CLI can return only 1,000 events. Use the Azure Sphere Public API to specify the maximum number of events returned on the page.

You can download data about the errors and other events that affect your devices in the following ways:

  • By using the az sphere catalog download-error-report command. A CSV file containing information on errors and events reported by devices within the current catalog is downloaded.

  • By using the Azure Sphere Public API for error reporting. The API endpoint returns a JSON object that you can parse according to your needs.

No error reporting data is collected from RTApps. If you want to log errors from RTApps, you'll need to implement inter-core communications to communicate errors from the RTApps to the high-level application, from which the error data can be logged to network services.

Types of data available

The data returned for each error or event includes the following:

Data Description
Device ID ID of the device that encountered the event.
Event Type Whether the event was planned or unplanned. OS and app updates are considered planned events, whereas errors are unplanned events.
Event Class Software component that encountered the event: OS or application.
Event Count Number of times the event occurred within the period delimited by StartTime and EndTime.
Description Information about the event. This field is generic and varies depending on the event and its source. For applications, it may contain the exit code, signal status, and signal code, but the exact contents of the field are not fixed. This contains information about the event and is from the first occurrence of the event in the time window.
Start Time Date and time (in UTC) at which the event window began.
End Time Date and time (in UTC) at which the event window ended.

The Start Time and End Time define a window of time during which event data are aggregated. The window for any aggregated group of events can be up to 24 hours and the maximum is 8 occurrences per time window.

Application events

Application events include cloud-loaded app updates along with crashes, exits, and other types of application failures.

Application updates are planned events. For an AppUpdate event, the Description field contains AppUpdate.

Application crashes, exits, start-up failures, and similar events are unplanned events. For an unplanned event, the contents of the Description field depend on the application that encountered the event. The following table lists the fields that may be present in the Description field for an unplanned event.

Data Description
exit_status or exit_code Exit status or code reported by the application.
signal_status Integer that describes the high-level reason for the crash, returned by the OS. You can find a list of statuses in the Man 7 documentation or other Linux resources.
signal_code Integer that indicates the detailed crash status within the parent signal status. See the Man 7 documentation or other Linux resources for details.
component_id GUID of the software component that crashed.
image_id GUID of the image that was running at the time of the error.

The specific information in an AppCrash description depends on the source of the crash. For most crashes, the description looks similar to the following:

AppCrash (exit_status=11; signal_status=11; signal_code=3; component_id=685f13af-25a5-40b2-8dd8-8cbc253ecbd8; image_id=7053e7b3-d2bb-431f-8d3a-173f52db9675)

In some cases, a crash triggers additional error data, such as the following, which supplements the data in the previous example:

AppCrash (pc=BEEED2EE; lr=BEEED2E5; sp=BEFFDE58; signo=11; errno=0; code=0; component_id=685f13af-25a5-40b2-8dd8-8cbc253ecbd8; pc_modulename+offset=appname+80000; lr_modulename+offset=app+100CC)

Data Description
pc Program Counter. Points to the address of the instruction that triggered the crash.
lr Link Register. Possibly points to the return address in the calling function.
sp Stack Pointer. Points to the top of the call stack.
signo POSIX signal. Indicates error type.
errno POSIX errno. Indicates an error.
code Indicates the detailed crash status within the parent signal status.
component_id GUID of the software component that crashed.
pc_modulename+offset Name of the module and offset into the module containing the code where the crash occurred.
lr_modulename+offset Name of the module and offset into the module that might have been the calling function.

Interpret AppCrashes

You can find most of the information about an AppCrash in the signal_status and signal_code. Follow these steps:

  1. Using the Man 7 documentation for signal_status, first look at the table labeled "Signal Numbering for Standard Signals." In the x86/ARM column, search for the value assigned to the signal_status in the error report csv. Once found, note the corresponding Signal name in the leftmost column.
  2. Scroll up to the table labeled "Standard Signals." Match the previously determined Signal name and use the table to gather more information about what the signal indicates.
  3. In the Man 7 documentation for signal_code and the Signal name you previously found, locate the corresponding list of si_codes.
  4. Use the value assigned to the signal_code in the error report csv file to determine which code matches the error message.

For example, consider the following AppCrash description:

AppCrash (exit_status=11; signal_status=11; signal_code=3; component_id=685f13af-25a5-40b2-8dd8-8cbc253ecbd8; image_id=7053e7b3-d2bb-431f-8d3a-173f52db9675)

Using the Man 7 documentation, you can discover the following additional information about the AppCrash:

  1. Signals are described in the 10th section of the description of the Signal man page. A signal_status of value 11 corresponds to a SIGSEGV signal.
  2. SIGSEGV indicates that an invalid memory reference occurred (this can often be a null pointer).
  3. SI_Codes are described in the 3rd section of the description of the SigAction man page for each signal_status. Though the page does not list an index number for each si_code, you can count from each signal_status category beginning at index 1. By looking at the list of si_codes for SIGSEGV (beginning at index 1), you can see that the third matches a SEGV_BNDERR.
  4. SEGV_BNDERR indicates that a failed address bound check occurred.

Note

A commonly encountered AppCrash includes a signal_status value of 9, which is a SIGKILL signal, along with the SEND_SIG_PRIV si_code. This status indicates that the OS killed the application because it exceeded its memory usage limit. To learn more about application memory limits see Memory use in high-level applications.

Interpret AppExits

When an app exits without error, the signal_status and signal_code fields are not present, and instead of an exit_status, the Description contains an exit code:

AppExit (exit_code=0; component_id=685f13af-25a5-40b2-8dd8-8cbc253ecbd8; image_id=0a7cc3a2-f7c2-4478-8b02-723c1c6a85cd)

AppExits can occur for a number of reasons, such as an application update, a device being unplugged, or the use of the power down API, among others. It is important to implement exit codes so that you can gain insight into the reasons for an AppExit.

To interpret AppExits, use the exit_code value in the Description field of the error report. If your app returns an exit code, you can use the value of the exit_code in the error report to determine where or when the error occurred. Using this value, search within the application code to see which exit code message corresponds to the value provided in the error report. Then, look to find which function in the application returned the exit code message and why it did so. By viewing the return statement and its context, you may be able to discover the reason for the error.

OS events

Error data also includes underlying OS and hardware events that may impact your application by causing it to fail or restart. Such events can include the following:

  • Unplanned device reboots caused by kernel errors
  • Cloud OS updates
  • Transient hardware problems

OS events are included in the data to help you determine whether application errors are the result of an OS or hardware problem or reflect problems with the application itself. If the event data shows that a device booted to Safe Mode, your apps might be unable to start.

Explore error data

If you plan to develop scripts or tools for analyzing error data, but you don't have a large number of devices available to report errors, you can use the Azure Sphere sample applications to generate such data for testing. The Tutorials/ErrorReporting sample in the Azure Sphere samples repo explains how to analyze errors reported when the application crashes. Follow the instructions in the readme to build the sample using Visual Studio, Visual Studio Code, or the command line.

When you deploy the app from the command line without a debugger, the OS restarts it each time it fails. Similar events are aggregated so that one frequently failing device doesn't mask errors from others and the maximum is eight occurrences per time window. You can deploy the sample from the command line without debugging, as follows:

az sphere device sideload deploy --image-package <path to image package for the app>

Generate and download error report

Error and event data is uploaded to the Azure Sphere Security Service daily. Make sure that the Azure Sphere device is connected to the internet using Wi-Fi or Ethernet for communicating with the Azure Sphere Security Service.

  1. Run the following command to download the report to a CSV file:

    az sphere catalog download-error-report --destination error.csv
    
  2. Open the downloaded CSV file and look for your component ID. You should see an error description similar to the following:

    AppExit (exit_code=0; component_id=685f13af-25a5-40b2-8dd8-8cbc253ecbd8; image_id=6d2646aa-c0ce-4e55-b7d6-7c206a7a6363)

You can also use the Azure Sphere Public API for error reporting.

Note

  • It may take up to 24 hours for recently reported events to be available for download.
  • If an event or error occurs before the device connects with an NTP server, the timestamp for the event contained in the telemetry uploaded to AS3 may be incorrect. This will be reflected in an incorrect entry in the StartTime column in the subsequent report downloaded from AS3. In this situation use the EndTime field of the report to aid in estimating when the event occurred. This field contains the time that the cloud services received the uploaded telemetry and will always have a valid date.

Format error data

The timestamps and data columns in the error report file are formatted differently from a typical CSV file. If you want to view the results in Excel, you can reformat the data by creating new columns and adding custom formulas.

To format the timestamps in the exported CSV file to work with Excel:

  1. Create a new Timestamp column and create a custom format for it:

    yyyy/mm/dd hh:mm:ss

  2. Add the following formula to the cells in the new Timestamp column, changing the F2 cell value to match your column and row:

    =(DATEVALUE(LEFT(RawErrorReport!F2,10))+TIMEVALUE(RIGHT(RawErrorReport!F2,8)))

To split the Description field into separate columns, follow these steps, changing the F2 cell value to match your column and row:

  1. Create a new column named Shortname or something similar, and add the following formula to the cells:

    =TRIM(LEFT(F2,FIND("(",F2)-1))

  2. Create columns in which the row1 headers have the same names as the parameter values and add the following formula to the cells in each of the columns:

    =IF(ISERROR(FIND("; " & H$1 & "=", SUBSTITUTE($F2,"(","; "))), "", MID($F2, FIND("; " & H$1 & "=", SUBSTITUTE($F2,"(","; ")) + (LEN(H$1) + 2), FIND("; ", SUBSTITUTE($F2,")","; "), FIND("; " & H$1 & "=", SUBSTITUTE($F2,"(","; "))) - FIND("; " & H$1 & "=", SUBSTITUTE($F2,"(","; ")) - (LEN(H$1) + 2)))