Avvenimenti
Mar 17, 9 PM - Mar 21, 10 AM
Ingħaqad mas-serje meetup biex tibni soluzzjonijiet skalabbli tal-IA bbażati fuq każijiet ta 'użu fid-dinja reali ma' żviluppaturi u esperti sħabi.
Irreġistra issaDan il-brawżer m'għadux appoġġjat.
Aġġorna għal Microsoft Edge biex tieħu vantaġġ mill-aħħar karatteristiċi, aġġornamenti tas-sigurtà, u appoġġ tekniku.
This article describes how to scale an Internet of Things (IoT) solution with a scale-out pattern on the Azure IoT Hub platform. The scale-out pattern solves scaling challenges by adding instances to a deployment, rather than increasing instance size. The implementation guidance here shows you how to scale an IoT solution with millions of devices and account for the service and subscription limits in Azure. The article outlines the low-touch and zero-touch deployment models of the scale-out pattern that you can adopt depending on your needs. For more information, see these articles:
Nota
This document does not cover the Azure IoT Operations platform, which scales based on the hosting Kubernetes platform configuration.
You should always gather requirements before implementing a new IoT solution. Starting with the requirements helps ensure the implementation meets your business objectives. The business objectives and operational environment should drive the requirements you need to gather. At a minimum, you should know the following requirements.
Know the types of devices you want to deploy. IoT encompasses a wide range of solutions, from simple microcontrollers (MCUs) to midlevel system-on-chip (SOC) and microprocessor (MPU) solutions, to full PC-level designs. The device-side software capabilities directly influence the design of the solution.
Know how many devices you need to deploy. Some of the basic principles of implementing IoT solutions apply at all scales. Knowing the scale helps avoid designing a solution that is more complicated than necessary. A solution for 1,000 devices will have some fundamental differences from a solution for 1 million devices. A proof-of-concept (PoC) solution for 10,000 devices might not scale appropriately to 10 million devices if the target scale wasn't considered at the start of the solution design.
Knowing how many devices you need to deploy helps you pick the right Azure IoT service. The scaling for IoT Hub and for IoT Hub device provisioning service (DPS) are different. By design, a single DPS instance can route to multiple IoT Hub instances. So the scale of each service needs to be considered individually with respect to the number of devices. However, limits don't exist in a vacuum. If limits are a concern on one service, they're usually a concern on others. So service limits should be considered as distinct, but related, quotas.
Document the anticipated device locations. Include not just physical location, but also power availability and internet connectivity. A solution that's deployed in a single geography (such as only in North America) is designed differently from a global solution. Likewise, an industrial IoT solution deployed in factories with full-time power differs from a fleet management solution that's deployed in motor vehicles with variable power and location. Also, the protocol you use and bandwidth available for device communications, to a gateway or directly to cloud services, affect design scalability at each layer. Hidden in this aspect is connectivity availability. Are the devices expected to remain connected to Azure, or do they run in a disconnected mode for extended periods?
Investigate data locality requirements. There might be legal, compliance, or customer requirements on where you can store data (such as telemetry) or metadata (data about the data, such as what devices exist) for the solution. These restrictions, if they exist, are a key input to the solution's geographical design.
Determine data exchange requirements. A solution that sends basic telemetry such as “current temperature” once an hour is different from a solution that uploads 1-MB sample files once every 10 minutes. A solution that's primarily a one-way, device-to-cloud (D2C) solution differs from a bidirectional device-to-cloud and cloud-to-device (C2D) solution. Also, product scalability limitations treat message size and message quantity as different dimensions.
Document expected high availability and disaster recovery requirements. Like any production solution, full IoT solution designs include availability (uptime) requirements. The design needs to cover both planned maintenance scenarios and unplanned downtime, including user error, environmental factors, and solution bugs. Such designs also need to have a documented recovery point objective (RPO) and recovery time objective (RTO) if a disaster occurs, such as a permanent region loss or malicious actors. Because this article focuses on device scale, there’s only a limited amount of information around high availability and disaster recovery (HA/DR) concerns.
Decide on a customer tenancy model (if appropriate). In a multitenant independent software vendor (ISV) solution, where the solution developer is creating a solution for external customers, the design must take into account how customer data is segregated and managed. The Azure Architecture Center discusses general patterns and has IoT-specific guidance.
Part of creating the solution is choosing which Azure IoT components (and possibly other supporting Azure services) you use as part of the solution, including the supporting services. A large amount of your effort comes from an architecture viewpoint, which is the focus of this document. Properly using the Azure IoT Hub and Azure IoT Hub DPS services can help you scale your solutions to millions of devices.
Azure IoT Hub is a managed service hosted in the cloud that acts as a central message hub for communication between an IoT application and its attached devices. It can be used alone or with Azure IoT Hub DPS.
Azure IoT Hub scales based on the combination of functionality desired and the number of messages per day or data per day desired. Three inputs are used for selecting the scaling of an instance:
Other than the daily limits tied to the size and unit count, and the general functionality limits tied to the tier, there are per-second limits on throughput. And there's a limit of 1 million devices per IoT Hub instance as a soft limit. Although it's a soft limit, there's also a hard limit. Even if you intend to request a limit increase, you should design with the soft limit as your design limit to avoid issues in the future. The data exchange requirements help guide the solution here. For more information, see other limits.
The requirements for your solution drive the necessary size and number of IoT hubs as a starting point. If you use IoT Hub DPS, Azure helps you distribute your workloads across multiple IoT Hub instances.
Azure IoT Hub device provisioning service (DPS) is a helper service for IoT Hub that enables zero-touch, just-in-time provisioning to the right IoT hub without requiring human intervention. It has a hard limit of 10 DPS instances per Azure subscription. The service also has a hard limit of 1 million registrations per service instance. You must address service limits in your workload design limit to avoid issues in the future.
Service instances for DPS are geographically located, but by default have a global public endpoint. Specific instances are accessed through ID scope. Because instances are in specific regions and each instance has its own ID scope, you should be able to configure ID scope for your devices.
A few critical shared resiliency concepts that you need to consider are transient fault handling, device location impact, and, for ISVs, software as a service (SaaS) data resiliency.
Understand transient fault handling. Any production distributed solution, whether it's on-premises or in the cloud, must be able to recover from transient (temporary) faults. Transient faults are sometimes considered more likely in a cloud solution because of:
As described at the Azure Architecture Center, transient fault handling requires that you have a retry capability built into your device code. There are multiple retry strategies (for example, exponential backoff with randomization, also known as exponential backoff with jitter) described in Transient fault handling. This article refers to those patterns without any further explanation. So refer to that page if you aren't familiar with them.
Different factors can affect the network connectivity of a device:
All these concerns also affect the timing of device availability and connectivity. For example, devices that are line-powered but common in dense, urban environments (such as smart speakers) might see a large number of devices go offline all at once, and then come back online all at once. Possible scenarios include:
Because of the "many devices booting at once" scenario, cloud service concerns can affect even scenarios with assumed near-100% network connectivity, such as throttling (limiting the traffic allowed to a service).
Beyond network and quota issues, it’s also necessary to consider Azure service outages. They could be service outages or regional outages. Whereas some services (such as IoT Hub) are geo-redundant, other services (such as DPS) store their data in a single region. Although it might seem like it restricts regional redundancy, it’s important to realize that you can link a single IoT hub to multiple DPS instances.
If regional redundancy is a concern, use the geode pattern, which is where you host a heterogeneous group of resources across different geographies. Similarly, a deployment stamp (also known as a scale stamp) applies this pattern to operate multiple workloads or tenants. For more information, see Deployment stamp patterns.
Understand device location impact. When architects select components, they must also understand that most Azure services are regional, even the ones like DPS with global endpoints. Exceptions include Azure Traffic Manager and Microsoft Entra ID. So the decisions you make for device location, data location, and metadata location (data about data: for example, Azure resource groups) are important inputs in your design.
The Azure Cloud Adoption Framework includes guidance on regional selection.
Understand independent software vendor (ISV) SaaS concerns. As an ISV offering SaaS, it's important to meet customers' expectations for availability and resiliency. ISVs must architect Azure services to be highly available, and they must consider the cost of resiliency and redundancy when billing the customer.
Segregate the cost of goods sold (COGS) based on customer data segregation for each software customer. This distinction is important when the end user isn't the same as the customer. For example, in a smart TV platform, the platform vendor's customer might be the television vendor, but the end user is the purchaser of the television. This segregation, driven by the customer tenancy model from the requirements, requires separate DPS and IoT Hub instances. The provisioning service must also have a unique customer identity, which can be indicated through a unique endpoint or device authentication process. For more information, see IoT multitenant guidance.
When discussing scaling IoT solutions, it’s appropriate to look at each service and how they might interrelate. You can scale your IOT solution across multiple DPS instances or using Azure IOT Hub.
Given DPS service limits, it’s often necessary to expand to multiple DPS instances. There are several ways you can approach device provisionings across multiple DPS instances, which break down into two broad categories: zero-touch and low-touch provisioning.
All the following approaches apply the previously described “stamp” concept for resiliency and for scaling out. This approach includes deploying Azure App Service in multiple regions with a tool such as Azure Traffic Manager or the global load balancer. For simplicity, it isn't shown in the following diagrams.
(1) Zero-touch provisioning with multiple DPS instances: For zero-touch (automated) provisioning, a proven strategy is for the device to request a DPS ID scope from a web API, which understands and balances devices across the horizontally scaled-out DPS instances. This action makes the web app a critical part of the provisioning process, so it must be scalable and highly available. There are three primary variations to this design.
The following diagram illustrates the first option: using a custom provisioning API that manages how to map the device to the appropriate DPS pool, which in turn maps (through standard DPS load balancing mechanisms) to the appropriate IoT Hub instance:
This design requires the device software to include the DPS SDK and manage the DPS enrollment process, which is the typical design for an Azure IoT device. But in a microcontroller environment, where device software size is a critical component of the design, it might not be acceptable, which would lead to another design.
(2) Zero-touch provisioning with a provisioning API: The second design moves the DPS call to the provisioning API. In this model, the device authentication against DPS is contained in the provisioning API, as is most of the retry logic. This process allows more advanced queuing scenarios and potentially simpler provisioning code in the device itself. It also allows for caching the assigned IoT hub to facilitate faster cloud-to-device messaging. The messages are sent without needing to interrogate DPS for the assigned Hub information:
The device makes a request to a provisioning API that’s hosted in an instance of Azure App Service. The provisioning API checks with its persistent database to see which instance is best to assign the device to based on existing device inventory, and then it determines the DPS ID scope. In this case, the database that’s proposed is an Azure Cosmos DB instance with multi-master write enabled (for cross-region high availability) that stores each device's assigned DPS. You can then use this database for tracking use of the DPS instances for all appropriate metrics (such as provision requests per minute, total provisioned devices, and so on). The database also allows you to supply a reprovision request by using the same DPS ID scope when appropriate. Authenticate the provisioning API in some way to prevent inappropriate provisioning requests. You can likely do this by using the same authentication that the provisioning service uses against DPS, for example, with a private key for an issued certificate. But other options are possible. For example, FastTrack for Azure (FTA) has worked with a customer that uses hardware unique identifiers as part of its service authentication process. The device manufacturing partner regularly supplies a list of unique identifiers to the device vendor to load into a database, which references the service behind the custom provisioning API.
The provisioning API performs the DPS provisioning process with the assigned ID scope, effectively acting as a DPS proxy.
The DPS results are forwarded to the device.
The device stores the IoT hub connection information in persistent storage, ideally in a secured storage location because the ID scope is part of the authentication against the DPS instance. The device uses this IoT hub connection information for later requests into the system.
This design avoids the need to reference the DPS SDK or the DPS service. It also avoids the need for storing or maintaining a DPS scope on the device. It allows for transfer of ownership scenarios as a result because the provisioning service can direct to the appropriate final customer DPS instance. However, it causes the provisioning API to somewhat duplicate DPS in concept, which might not be ideal.
(3) Zero-touch provisioning with transfer of ownership: A third possible zero-touch provisioning design is to use a factory-configured DPS instance as a starting point, and then redirect as necessary to other DPS instances. This design allows for provisioning without a custom provisioning API but it requires a management application to track DPS instances and supply redirection as necessary.
The management application requirements include tracking which DPS should be the active DPS for each specific device. You can use this approach for “transfer of ownership” scenarios, where the device vendor transfers ownership of the device from the vendor to the end device customer.
(4) Low-touch provisioning with multiple DPS instances In some cases, such as in consumer-facing scenarios or with field deployment team devices, a common choice is to offer low-touch (user-assisted) provisioning. Examples of low-touch provisioning include a mobile application on an installer’s phone or a web-based application on a device gateway. In this case, the proven approach is to perform the same operations as in the zero-touch provisioning process, but the provisioning application transfers the details to the device.
There are other possible variations not detailed in this article. For example, you can configure the architecture shown here by moving the DPS call to the provisioning API, as shown earlier in the Zero-touch provisioning with a provisioning API. The goal is to make sure each tier is scalable, configurable, and readily deployable.
General DPS provisioning guidance: You should apply the following recommendations to your DPS deployment, which represent general best practices for this Azure service:
Don't provision on every boot. The DPS documentation specifies that the best practice isn't to provision on every boot. For small use cases, it might seem reasonable to provision at every boot because that’s the shortest path to deployment. However, when scaling up to millions of devices, DPS can become a bottleneck, given its hard limit of 1,000 registrations per minute per service instance. Even device registration status lookup can be a bottleneck because it has a limit of 5 to 10 polling operations per second. Provisioning results are usually a static mapping to an IoT hub. So, unless your requirements include automated reprovisioning requests, it's best to perform them only on demand. If you anticipate more traffic, scaling out to multiple DPS instances might be the only way to support such scenarios.
Use a staggered provisioning schedule. One recommendation for mitigating some of the time-based limitations is using a staggered provisioning schedule. For an initial provisioning, depending on the deployment requirements, this schedule might be based on a random offset of a few seconds, or it might be a maximum of many minutes.
Always query status before requesting provisioning. As a best practice, devices should always query their status before requesting provisioning by using the Device Registration Status Lookup API. This call doesn't currently count as a billed item, and the limit is independent of the registration limit. The query operation is relatively quick compared to a provision request, which means that the device can validate its status and move on to the normal device workload more quickly. The appropriate device registration logic is documented in the large-scale deployment documentation.
Follow provisioning API considerations. Several of the designs proposed here include a provisioning API. The provisioning API needs a backing metadata store such as Azure Cosmos DB. At these scale levels, it's best to implement a globally available and resilient design pattern, which is a good pattern for this API and backing data store. The built-in multi-master, geo-redundant capabilities and latency guarantees in Azure Cosmos DB make it an excellent choice for this scenario. The key responsibilities of this API include:
Compared to scaling out DPS, scaling out Azure IoT Hub is relatively straightforward. As mentioned earlier, one of the benefits of DPS is that an instance can be linked to many IoT Hub instances. If DPS is used as is recommended for all Azure IoT solutions, scaling out IoT Hub is a matter of:
There are many best practices to follow and device-side considerations for scalable device design. Some of them are directly derived from anti-patterns experienced in the field. This section describes concepts that are key to a successfully scaled deployment.
Estimate workloads across different parts of the device lifecycle and scenarios within the lifecycle. Device registration workloads can vary greatly between development phases (pilot, development, production, decommissioning, end of life). In some cases, they can also vary based on external factors such as the previously mentioned blackout scenario. Designing for the “worst case” workload helps ensure success at scale.
Support reprovisioning on demand. You can offer this feature through a device command and an administrative user request, which is mentioned in the product documentation. This option lets you transfer ownership scenarios and factory default scenarios.
Don’t reprovision when it’s not necessary. It’s unusual for an active, working device in the field to need reprovisioning because provisioning information is relatively static. Don’t reprovision without a good reason.
Check provisioning status if you must reprovision often, for example at every device boot. If the device provisioning status is in doubt, begin by querying the provisioning status first. The query operation is handled against a different quota than a provisioning operation and is a faster operation than a provisioning operation. This query allows the device to validate the provisioning status before proceeding. You might see a case, for example, when a device doesn't have available persistent storage to store the provisioning results.
Ensure a good retry logic strategy. The device must have appropriate retry algorithms built in to the device code for both initial provisioning and later reprovisioning, such as the previously mentioned "exponential backoff with randomization." These scenarios might be different for the two use cases. Initial provisioning, by definition, might need to be more aggressive in the retry process than reprovisioning, depending on the use case. When throttled, DPS returns an HTTP 429 ("Too many requests") error code, like most Azure services. The Azure Architecture Center has guidance about retry and, more importantly, anti-patterns to avoid with respect to retry scenarios. The DPS documentation also has information on how to know what the service is recommending for a retry interval and how to calculate an appropriate jitter as part of its scaling guidance. The device location stability and connectivity access also influence the appropriate retry strategy. For example, if a device is known to be offline for periods of time, and the device can detect that it's offline, there’s no point in retrying online operations while the device is offline.
Support over-the-air (OTA) updates. Two simple update models are the use of device twin properties with automatic device management and the use of simple device commands. For more sophisticated update scenarios and reporting, see the preview of the Azure Device Update service. OTA updates allow for correcting defects in device code and for fundamental service reconfiguration (for example, DPS ID scope) if necessary.
Architect for certificate changes at all layers and all certificate uses. This recommendation is tied to the OTA update best practice. You must consider certificate rotation. The IoT Hub DPS documentation touches on this scenario from a device identity certificate viewpoint. However, it's important to remember as part of a device solution that other certificates are being used, such as for IoT Hub access, App Service access, and Azure Storage account access. The root certificate change across the Azure platform shows that you must anticipate changes at all layers. Also, use certificate pinning with caution, especially when certificates are outside the device manufacturer’s control.
Consider a reasonable "default" state. To resolve initial provisioning failures, have a reasonable disconnected or unprovisioned configuration, depending on the circumstances. If the device has a heavy interaction component as part of initial provisioning, the provision process can occur in the background while the user performs other provisioning tasks. In any case, implied in the use of a default is the use of an appropriate retry pattern and the proper use of the circuit breaker architectural pattern.
Include endpoint configuration capabilities where appropriate. Allow configuration of the DPS ID scope, the DPS endpoint, or the custom provisioning service endpoint. The DPS endpoint isn't expected to change, but because you can change it on the device, you have greater flexibility. For example, consider the case of automated validation of the device provisioning process through integration testing without direct Azure access. Or consider the possibility of future provisioning scenarios not in place today, such as through a provisioning proxy service.
Use the Azure IoT SDKs for provisioning. Whether the DPS calls are on the device itself or in a custom provisioning API, using the Azure IoT SDKs means you get some best practices in the implementation "for free," and it allows cleaner support experiences. Because the SDKs are all published open source, it's possible to review how they work and to suggest changes. A combination of which device hardware you select and the available runtime or runtimes on the device primarily drive which SDK you select.
Device deployment is a key part of the device lifecycle, but it's outside the scope of this article because it’s dependent on the use case. The previously referenced discussion points around “transfer of ownership” might apply to the deployment and the patterns that involve a provisioning application (for example, a mobile application), but you select it based on the IoT device type in use.
An important part of your overall deployment is to monitor the solution from start to finish to make sure that the system performs appropriately. Because this article is explicitly focused on architecture and design and not on the operational aspects of the solution, discussing monitoring in detail is out of scope. However, at a high level, monitoring tools are built into Azure through Azure Monitor to ensure that the solution isn't hitting limits. For details, see these articles:
You can use these tools individually or as part of a more sophisticated Security Information and Event Management (SIEM) solution like Microsoft Sentinel.
The documentation includes the following monitoring patterns for monitoring the usage of DPS over time:
Scaling up an IoT solution to support millions, or even tens or hundreds of millions, of devices isn't a straightforward task. There are many factors to consider and various ways to solve the issues that arise at those scales. This article summarizes the concerns and supplies approaches for how to address those concerns in a successful deployment.
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
Other contributors:
To see non-public LinkedIn profiles, sign in to LinkedIn.
Avvenimenti
Mar 17, 9 PM - Mar 21, 10 AM
Ingħaqad mas-serje meetup biex tibni soluzzjonijiet skalabbli tal-IA bbażati fuq każijiet ta 'użu fid-dinja reali ma' żviluppaturi u esperti sħabi.
Irreġistra issaTaħriġ
Ċertifikazzjoni
Microsoft Certified: Azure Cosmos DB Developer Specialty - Certifications
Write efficient queries, create indexing policies, manage, and provision resources in the SQL API and SDK with Microsoft Azure Cosmos DB.