High-level core application design suggestions

To build a high-level (HL) core application on solid foundations, you should use fundamental best practices. The following are the most relevant:

High-level (HL) core applications run containerized on the Azure Sphere OS. During code and design reviews of customers solutions we've found several typical issues with HL-core applications. This topic discusses suggestions for design improvements to address these issues.

General fundamentals

To build an HL-core application on solid foundations, you should use fundamental best practices. The following are the most relevant:

  • Initialization and termination: Always make sure to handle the Azure Sphere OS's SIGTERM signal, and properly initialize and destroy all handlers (such as those for peripherals) upon exit, either upon crash or error. For more details, see Initialization and termination and the GNU documentation on Termination Signals.
  • Always use exit codes: Making sure the HL-core application always provides a meaningful return code upon exit or crash (for example, using the SIGTERM handler) is essential for properly diagnosing the device's behavior, especially from the device's crash dump telemetry. For more information, see Exit codes and Collect and interpret error data.
  • Ensure that failure cases always result in an application exit or crash rather than in a deadlock state: Elaborate failure recovery logic can be counterproductive as it can introduce bugs or behaviors resulting in a deadlock or a state that is difficult to diagnose. A well-designed Azure Sphere application should always prefer crashing or exiting (with a non-zero exit code) to a potential deadlock situation, as this results in both:
    • Error telemetry, enabling diagnostics for this issue
    • A chance of immediate recovery to a working state, since the Azure Sphere OS will restart the application
  • Error handling and logging: Precise error handling and logging are at the core of quality application development. Quick functionality implementations can remain buried in layers of code, then built over as the application develops up to full scale. For more information on best practices, see Error handling and logging.
  • Use a system timer as a watchdog: One of the most crucial best practices is to implement a "watchdog timer" callback (much like the hardware ones available in bare-metal MCUs) that tracks critical application states, detecting deadlock and acting accordingly (for example, exiting and sending telemetry). For more information, see Use a system timer as a watchdog.
  • Never deploy production applications that have been built targeting a beta release toolset: Using beta release toolsets are not recommended because it cannot be guaranteed that the beta subset won't change in subsequent OS versions. Beta toolsets are solely released for testing new features in advance of an official SDK release.

Handling concurrency

  • Use EventLoop whenever possible: Threads and synchronization objects (that is, mutexes, semaphores, and so on) are used to accomplish nearly concurrent tasks, but within embedded systems these are expensive in terms of system resources usage. Therefore, to improve performance, consider using epolls instead of threads, for those tasks that are not strictly time critical and are not sensitive to mutual blocking. See Applibs eventloop.h for information on how to monitor and dispatch events with EventLoop, including related samples.
  • Look for efficiency on concurrent tasks: It's important to ensure that blocking operations and timeouts are kept to a minimum within epoll callbacks, otherwise all other epoll callbacks will be affected.
  • When to use threads (pthread): For specific scenarios, such as when blocking calls are unavoidable, using threads can be beneficial, although typically those scenarios would have a limited lifetime and should be scoped to specific tasks. For example, given that the Azure Sphere OS (running Linux) does not expose IRQs to HL-core applications (this is available only for RT-core Apps), using a combination of epoll and pthread tasks could be optimal in handling, for example, a downstream serial communication while downloading data from the internet.

Important

The Azure Sphere OS might interrupt timely operations, especially when it is performing device attestation, checking for updates, or uploading telemetry. For time-critical control tasks, consider moving these to the M4 cores and coordinate them with an appropriate protocol through the inter-core mailbox. For more information, see the Inter-core communication sample.

In addition to these suggestions, review the Azure Sphere documentation on Asynchronous events and concurrency.

Connectivity monitoring

A well-designed, high-level (HL) core application must implement a proper Connectivity Health Checking task, which should be based on a robust state machine that regularly checks the status of the internet connection (for instance, using an epoll timer) by leveraging the Networking_IsNetworkingReady API. In some cases you can use the Networking_GetInterfaceConnectionStatus Function, as it provides a more in-depth status of the connectivity state related to a specific network interface that the HL-core application can use to better address its state, although this comes at a cost as it is not recommended to call it more frequently than every 90 seconds.

The state machine callback should typically have the following attributes:

  • Execute as quickly as possible.
  • Its polling interval must be carefully designed, based on the specific application scenario and overall solution requirements (such as constant time, incremental delay, and so on).
  • Once a disconnection is detected, it may be useful to call Networking_GetInterfaceConnectionStatus once to log the specific network interface's state, which can be used to diagnose the problem and notify the user through a UI (such as LEDs, display, terminal). A sample of this approach can be found in the main code of the Azure Sphere DHCP Sample.
  • Activate a mechanism (for example, through a global variable) that halts all other tasks in the HL-core application that perform (or are tied to) network communications to optimize resource consumption until a connection is reestablished.
  • cURL has recently updated callback behavior and best practises. While Azure Sphere has taken efforts to ensure older versions of cURL behavior continue to work as expected, it is recommended to follow the latest guidance for security and reliability when using curl_multi, as the use of recursive callbacks can result in unexpected crashes, connectivity outages and potential security vulnerabilities. If a TimerCallback fires with a timeout of 0ms, treat it as a timeout of 1ms to avoid recursive callbacks. Be sure to also call curl_multi_socket_action explicitly at least once following calls to curl_multi_add_handle.

In addition to the previous suggestions, you should consider the following scenarios for power management:

  • Power down the Azure Sphere chip after sending data. For details, see Manage Power Down state for Azure Sphere devices.
  • Since several issues can result from long exponential back-off timeouts, it is critical to track the total uptime and set a shutdown timer to a reasonable limit so as not to drain the battery in conditions where connectivity is no longer possible due to external outages or other factors beyond the application's control.
  • In controlling connectivity monitoring during outages, the Wi-Fi transceiver can power down by disabling the wlan0 network interface (see Networking_SetInterfaceState and waiting until the next check for connectivity again, saving approximately 100mW.

Memory management and usage

On memory-constrained platforms, applications that perform frequent memory allocations and de-allocations could cause the OS's memory management to struggle with efficiency, thus causing excessive fragmentation and memory run-out. Specifically on the Azure Sphere MT3620, this can lead to out-of-memory conditions that could trigger the Azure Sphere OS's cgroup OOM killer to initiate.

Understandably, applications are often developed starting from an initial proof-of-concept, which becomes more comprehensive with features required for progressive releases, eventually neglecting minor features that were initially included. The following are suggestions and optimizations that have proven effective for many scenarios analyzed in the field:

  • Especially within HL-core applications that make intensive use of memory, it is essential to track application memory usage through the Azure Sphere API, described in Determine run-time application RAM usage. Typically this is implemented in an epoll-timer watchdog and the application reacts accordingly to unexpected memory usage in order to restart in a reasonable manner; for example, exiting with the appropriate exit code.

    Several customers and partners have found it useful to use the Heap Tracker memory tracking utility, which is published in the Azure Sphere Gallery. This library transparently links to an existing HL-core application and tracks memory allocations and their related pointers, allowing simplified detection of most cases of memory leaks and pointer misuses.

Important

This practice can reduce apparently unexplained device unresponsiveness or failures often reported from the field. Such failures are usually caused by memory leaks or overruns that are not properly handled by the HL-core application and lead the OOM killer to shut down the application's process. This, along with poor connectivity that blocks the Azure Sphere OS from sending telemetry, can lead to potential field incidents as diagnosis can only be detected by pulling the Azure Sphere OS's diagnostic logs.

  • On memory-constrained platforms, it's generally preferable to avoid dynamic memory allocation whenever possible, especially within frequently called functions. This will greatly reduce the heap's memory fragmentation and the likelihood of subsequent heap allocation failures. Also consider a paradigm shift from repetitively allocating temporary work-buffers to directly accessing the stack (for variables of reasonable sizes) or globally-allocated buffers, which increase in size (through realloc) upon overflow (see Dynamic containers and buffers). If there is a requirement to offload memory, consider taking advantage of unused memory on the M4 cores (see Memory available on Azure Sphere), which have 256KiB each, with a lightweight RT-core application for data caching. You could eventually use external SD cards or flash. Samples can be found at the following repos:

Following the above suggestions can also help in estimating and reserving the memory that would be needed for the HL-core application to work at full capacity across its lifecycle while allowing you to better estimate the application's overall memory footprint for later design optimizations. For more information about optimizing memory usage in HL-core applications, including features in the Azure Sphere OS and Visual Studio, refer to the following articles: