Reliability in Community Training
Community Training is a cloud-based solution that enables delivery of large-scale, far-spread training programs with high quality and efficiency. Community Training is an Azure-powered online learning platform to enable organizations of all sizes and types to run large scale training programs for their internal and external communities such as first line workers, extended workforces, partner ecosystem, volunteer network, program beneficiaries, etc.
This article describes reliability support in Community Training, and covers both regional resiliency with availability zones and disaster recovery and business continuity. For a more detailed overview of a reliability principles in Azure, see Azure reliability.
Availability zone support
Control Plane
The control plane is zone redundant in the primary regions of availability.
Data Plane
Azure availability zones are physical locations within an Azure region that provide increased high availability and protection against datacenter failures. Each zone is made up of one or more datacenters with independent power, cooling, and networking. Go here for further details.
Community Training uses Azure availability zones to provide high availability and fault tolerance within an Azure region. It offers both Zonal and zone-redundant high availability. Customer can choose either of the redundancy options.
Prerequisites
Community Training offers following SKU:
- Standard (lower scale of users )
- Premium (high scale of users )
List of regions where community training will be supported:
Americas | Europe | Middle East | Africa | Asia Pacific |
---|---|---|---|---|
West US 2 | UK South | Southeast Asia | ||
South Central US | West Europe | Australia East | ||
East US 2 | North Europe | |||
East US | Sweden Central | |||
West US 3 |
Note: In the secondary region, zonal availability will be provided.
Zonal failover support
Community Training provide configuration for redundancy at the time of creation. Customer should choose zone-redundant high availability to avoid any zonal failure. Community Training internally uses standard Azure services like App service, Azure SQL etc. Choosing zonal redunancy option will create zonal redundant offering of underlying azure resources which will take care of the fault tolerance.
Zone down experience
Zonal
If customer configures Community training as zonal and if any zone is down then Community Training will not be available during the zone down. Service availabitlity depend on various factors (datacenter down, specific service not available in that zone etc), either complete service or part of features may not be available.
Zone redundant
In a zone-wide outage scenario, users should experience no impact on provisioned resources in a zone-redundant deployment. During a zone-wide outage, customers should be prepared to experience a brief interruption in communication with provisioned resources. Clients typically receive 409 error codes in this situation, and retry logic attempts to re-establish connections at appropriate intervals. New requests are directed to healthy nodes with zero impact on the user. During zone-wide outages, users are able to create new resources and successfully scale existing ones.
Disaster recovery and business continuity
Control Plane:
Regions with paired regions: The control plane will be manually failed over to the paired region in case of a disaster. The service might experience degradation in the time before the failover occurs. After the failover, only read-only operations will be supported until the disaster region is back online. The service will be manually failed back to the original region once it is back online and all operations will resume. Regions without paired regions: The product is not available in regions without paired regions.
RPO: 10 minutes
RTO: 24 hours*
Data plane:
Community Training offers Microsoft managed disaster recovery. The customer has to Enable Disaster Recovery, when the Community Training instance is created in Azure. Once the customer enables eisaster recovery, we keep the backup of storage and database in the paired region.
Note
The Disaster recover feature is only available for regions which support paired regions.
- RPO: 12 hour
- RTO: 48 hours
RTO depends on database and storage size, latency between the paired region, database or storage VM capacity (SKU).
RPO of community training dependends on underlying azure resources like Azure SQL and Azure storage. Go here for further details.
Once we detect an Community Training instance is down in any region, we take customer consent for failover to a paired region, explaining the list of features availabe during the disaster recovery. Once we get consent from the customer, their Community Training instance is failed over to paired the region.
In the paired region, the following features will not be available:
Learners will not be able to consume Audio/Video features until primary site is operational (As AMS is removing its support from non-popular regions from January).
Recommendation for customer is to avoid media upload operations until primary site is operational.
Outage detection, notification, and management
A Community Training health check will detect any outage of any service. On detecting outage, we will reach out to the customer if disaster recovery is enabled. On receiving consent from the customer, the disaster recovery procedure will be managed by the Community Training team. Disaster recover is neither active-active nor active-passive, it is based on recovery from the most recent available backup of azure services. We manually create all resources in the paired region from the data backup.
Capacity and proactive disaster recovery resiliency
Microsoft and it's customers operate under the shared responsibility model. Once any region is down, not only is the community training instance migrated to the paired region but other product and customer workload is also migrated. This may cause shortage for resources in the paired region or datacenter. Hence, Disaster recovery availability depends on the available capacity of underlying azure resources etc.
Please reach out to us via HelpDesk if you have any further queries.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for