Prepare

Completed

You'll be adding your own enhancements to an existing architecture that meets the organization's high reliability requirements. Here, we'll discuss the background context needed to be successful with the exercises.

Problem context

Contoso Shoes needs to be ready for their next high-profile product launch, which is expected to create a significant increase in traffic. In the last two years, there have been several incidents causing the website to be offline for as long as half a day. The system wasn't tested completely in the dev/test environment and some bugs crept into production. Troubleshooting and remediation took a long time because the operators weren't able to identify the root causes quickly.

There have been some challenges when certain components aren't available. The scale-out operations on compute were impacted when Azure Key Vault was misconfigured. Also, there aren't any strategies in place for regional outages. In a recent incident, the entire West Europe region went down. Because the workload was only running in that region, they had to bear financial loss until the region was back up.

Current architecture

For you to complete this challenge, you need to have a good understanding of Contoso Shoe's current architecture. The focus will be on their API layer.

Diagram of the basic architecture for a web application.

Components

All components of this architecture are deployed to a single region.

  • Azure App Service plan Standard S2 provides the compute platform that hosts the app. Autoscaling is enabled. In the development environment, Basic B1 SKU is used.

  • Azure App Service provides the application platform that runs the API code in a container. The App Service Authentication feature is enabled for authorization.

  • Deployment slots let you stage a deployment and then swap it with the production deployment. They're used in production only.

  • Azure Container Registry stores the containerized API code and is pushed through Continuous Integration/Continuous Delivery (CI/CD) pipelines created and managed by the workload team. The container registry is used by both production and dev/test environments.

  • Azure Cosmos DB with SQL API** stores all state related to the workload. The Cosmos DB database account has a single database that contains a few containers in the Shared throughput model. The Azure Cosmos account uses the Serverless capacity mode. There's one instance for production and one for dev/test.

  • Azure Key Vault stores secrets needed for the API to make an HTTP POST call to an external, 3rd-party API as part of one request flow. The application accesses the secrets through a Key Vault reference in the Azure App Service’s app configuration. There's one Key Vault for production and one for dev/test.

  • Azure Log Analytics is used as a unified sink to store logs and metrics for all Azure Diagnostics settings for all components used in the solution. There's one workspace for production and one for dev/test.

  • Azure Application Insights is used for capturing telemetry and logs from the API. Application Insights uses the self-contained mode, not writing to a dedicated log analytics workspace. Production and dev/test don't share a common instance.

  • Azure Pipelines is used for CI/CD that builds, tests, and deploys the workload in preproduction and production environments. The pipelines are managed by the workload team, which also manages all of the infrastructure in their solution. Bicep is the choice of technology for Infrastructure-as-Code (IaC).

Design choices

In the list of components, the deployment stamp consists of services that participate in processing of a request: Those services include App Services and the API code and Cosmos DB. The stamp also includes non-functional components: Key Vault, Container Registry. The application has a third-party dependency on a performance and resilience framework. System-managed identities are used between components of the stamp.

In the stamp, App Services is configured to automatically scale based on load.

Separate environments are used for production and dev/test. The production environment uses App Service plan Standard SKU. This choice was made to have the capability of prewarming the application to a slot before deploying it to production. In the dev/test environment, the SKU is lowered to the Basic SKU for cost optimization. Both environments have their own instances of services. Only the Container Registry is shared between the environments.

The containerized API code is delivered in a single container image that runs in App Service. The API has multiple HTTP endpoints that are used by various frontends for both reads and writes. The frontends are out of scope for this module. However, they're in scope in the big picture for the mission critical status of this solution. The code was instrumented with Application Insights to capture some basic telemetry. The team that developed this code also manages the CI/CD pipeline for the API container image and the CI/CD pipelines.

Tradeoffs

However, as with everything, there are tradeoffs with the current architecture. Business requirements prioritized cost optimization over reliability and operations. To keep within the cost limits, the architecture hasn't evolved. The components fall short when taking advantage of the reliability capabilities offered by the platform. For example, the choice of SKU for compute prevents the workload from using Availability Zones. For telemetry, an older version of Application Insights is used that isn't integrated with Log Analytics.

Also, access to the workload is overly pervasive. For example, without any virtual network integration, all Azure services can be directly reached over the public internet.

When the solution was developed, the app development team used a single Azure Subscription, colocating dev/test and production in the same subscription. The choice was made to make tooling easy for the DevOps teams. But, production resources and dev/test resources aren't completely isolated. Some resources are shared between the two environments. They did get an isolated subscription from the rest of the solutions from Contoso Shoes.

Also, the dev/test environment is a single environment that is shared across all members of the development and QA team. The choice was justified given the size of the teams and coordination between them didn't need a higher degree of isolation. As the team and solution evolved, the single dev/test environment increasingly caused integration complexity as workstream lifecycles collide. The churn and its impact on reliability have been expensive.

Project specification

The company wants to add capabilities to their solution architecture so that it's able to handle the expected increase in load. Here are the business requirements:

  • Build the capability to withstand regional failures by extending the architecture to multiple regions
  • Improve customer experience by serving clients faster in a region geographically closer to them
  • Align with Azure roadmap and take advantage of the latest reliability features offered by Azure services
  • Catch issues early and detect their cascading impact in the system by building an overall health model

Those requirements are only the prioritized list of their improvement plans. The application team is aware that all design areas must be considered to bring this solution's reliability up to mission critical standards. Rest assured, they won't stop improving their solution and operations after you've helped them with the aspects covered in the upcoming exercises. Welcome to the team, Contoso Shoes is looking forward to hearing your recommendations.

Setup

In this Challenge Project, you'll be taking on the role of an architect who will help Contoso shoes achieve their reliability outcomes, starting with the prioritized items above.

  • We recommend that you use the architecture diagramming tool to visualize the architecture.
  • You don't need an Azure subscription for this challenge if you're comfortable with the services and their features.