Refine your application platform for streamlined application development and infrastructure management

A big part of improving your organization's platform engineering practices is to evaluate your application platform. An application platform includes all the tools and services used to support development, deployment, and application management in your organization.

Simplify and standardize

Infrastructure as code (IaC), and automation tools can be combined with templates to standardize infrastructure and application deployment. To reduce the burden of platform specifics on the end user, abstract platform details by breaking down choices into relatable naming conventions, for example:

  • Resource type categories (high compute, high memory)
  • Resource size categories (t-shirt sizing, small medium and large)

Templates should represent general requirements that have been tested with preset configurations, so dev teams can immediately get started with supplying minimal parameters and without needing to review options. However, there will be occasions where teams need to change more options on published templates than are available or desirable. For example, an approved design may need a specific configuration that is outside of the supported template defaults. In this instance operations or platform engineering teams can create a one-off configuration, and then decide whether the template needs to incorporate those changes as a default.

You can track changes using IaC tools with drift detection features that can automatically remediate drift (GitOps). Examples of these tools are Terraform and cloud native IaC tools (examples, Cluster API, Crossplane, Azure Service Operator v2). Outside of IaC tool drift detect there are cloud configuration tools that can query for resource configurations, such as Azure Resource Graph. These can serve as two benefits; you monitor for changes outside of the infrastructure code and to review for changed preset configurations. To avoid being too rigid, you can implement tolerances in deployments too with predefined limits. For example, you can use Azure Policy to limit the number of Kubernetes nodes that a deployment can have.

Self-managed or managed?

In public clouds you have the choice to consume SaaS, PaaS, or IaaS. To learn more about SaaS, PaaS, and IaaS, see the training module Describe cloud concepts. PaaS services offer streamlined development experiences but are more prescriptive with their app models. Ultimately, there's a trade-off between ease of use and control that you need to evaluate.

During platform design, evaluate and prioritize your organization's service needs. For example, whether you build apps directly on Azure Kubernetes Service (AKS) or through Azure Container Apps (ACA) depends on your requirements for the service and on your in-house capacity and skill set. The same goes for function-style services like Azure Functions or Azure App Service. ACA, Azure Functions, and App Service reduce complexity, while AKS provides more flexibility and control. More experimental app models like the OSS Radius incubation project try to provide a balance between the two, but are generally in earlier stages of maturity than cloud services with full support and a presence in established IaC formats.

The problems you identified when you planned should help you evaluate which end of this scale is right for you. Be sure to factor your own internal existing skill set as you make a decision.

Shared vs. dedicated resources

Within your organization, there are resources that can be shared by multiple applications to increase utilization and cost effectiveness. Each shared resourse has its own set of considerations. For example, these are considerations for sharing K8s clusters, but some will apply to other types of resources:

  • Organization: Sharing resources like clusters within, rather than across, organizational boundaries can improve how they align with organizational direction, requirements, and priority.
  • Application tenancy: Applications can have different tenancy isolation requirements; you need to review individual application security and regulatory compliance if it can coexist with other applications. For example, in Kubernetes, applications can use namespace isolation. But you should also consider application tenancy for different environment types. For example, it's often best to avoid mixing test applications with production applications on the same clusters to avoiding unexpected impacts due to misconfigurations or security issues. Or you might opt to first test and tune on dedicated Kubernetes clusters to track down these issues prior to deployment on a shared cluster instead. Regardless, consistency in your approach is the key to avoiding confusion and mistakes.
  • Resource consumption: Understand each application resource usage, spare capacity, and do a projection to estimate whether sharing is viable. You should also be aware of limits of the resources consumed (data center capacity or subscription limits). The goal is to avoid moving your application and dependencies due to resource constraints in a shared environment or to have live site incidents when capacity runs out. Use resource limits, representative testing, monitoring alerting, and reporting to identify resource consumption and protect against applications consuming too many resources.
  • Optimize shared configurations: Shared resources such as shared clusters require extra consideration and configuration. These considerations include cross charging, resource allocation, permissions management, workload ownership, data sharing, upgrade coordination, workload placement, establishing, managing, and iterating a baseline configuration, capacity management, and workload placement. Shared resources have benefits, but if the standard configurations are too restrictive and don't evolve then they become obsolete.

Implement governance strategies

Governance is a key part of enabling self-service with guardrails, but applying compliance rules in a way that doesn't impact time to business value for applications is a common challenge. There are two parts of governance:

  • Initial deployment compliance (start right): This can be achieved with standardized IaC templates that are made available through catalogs, with permission management and policies to ensure only allowed resources and configurations can be deployed.
  • Maintaining compliance (stay right): Policy based tools can prevent or alert you when there are resource changes. Beyond your core infrastructure, consider tools also support compliance inside resources like K8s along with OSs used in your containers or VMs. For example, you might want to enforce a locked down OS configuration or install security software tools such as Windows Group Policy, SELinux, AppArmor, Azure Policy, or Kyverno. If developers only have access to IaC repositories, you can add approval workflows to review proposed changes and prevent direct access to resource control planes (example, Azure).

Maintaining compliance requires tooling to access, report, and act on issues. For example, Azure Policy can be used with many Azure services for auditing, reporting, and remediation. It also has different modes such as Audit, Deny, and DeployIfNotExists depending on your needs.

While policies can enforce compliance, they can also break applications unexpectedly. Therefore, consider evolving to a policy as code (PaC) practice when operating at scale. As a key part of your start right and stay right approach, PaC provides:

  • Centrally managed standards
  • Version control for your policies
  • Automated testing & validation
  • Reduced time to roll out
  • Continuous deployment

PaC can help to minimize the blast radius of potentially a bad policy with capabilities such as:

  • Policy definitions stored as code in a repository that is reviewed and approved.
  • Automation to provide testing and validation.
  • Ring-based gradual rollout of policies & remediation on existing resources help to minimize the blast radius of potentially a bad policy.
  • Remediation task has safety built in, with controls such as stopping the remediation task if more than 90 percent of deployments fail.

Implement role-specific observability and logging

To support your applications and infrastructure, you need observability and logging across your entire stack.

Illustration of a Grafana dashboard showing observability and logging.

Requirements differ per role. For example, platform engineering and operations team require dashboards to review the health and capacity of the infrastructure with suitable alerts. Developers require application metrics, logs, and traces to troubleshoot and customized dashboards that show application and infrastructure health. One problem either of these roles might be encountering is cognitive overload from too much information or knowledge gaps due to a lack of useful information.

To resolve these challenges, consider the following:

  • Standards: Apply logging standards to make it easier to create and reuse standardized dashboards and simplify ingestion processing through something like the OpenTelemetry observability framework.
  • Permissions: Provide team or application-level dashboards using something like Grafana to provides rolled up data for anyone interested, along with a facility for trusted members of application teams to securely get access to logs when needed.
  • Retention: Retaining logs and metrics can be expensive, and can create unintended risks or compliance violations. Establish retention defaults and publish them as a part of your start right guidance.
  • Monitor resource limits: Operations teams should be able to identify and track any limitations for a given type of resource. These limitations should be factored into IaC templates or policies using tools like Azure Policy. Operations should then proactively monitor using dashboards in something like Grafana and expand shared resources where automated scaling isn't possible or enabled. For example, monitor the number of K8s cluster nodes for capacity as apps are onboarded and modified over time. Alerting is needed, and these definitions should be stored as code so they can be programmatically added to resources.
  • Identify key capacity and health metrics: Monitor and alert OS and shared resources (examples: CPU, memory, storage) for starvation with metrics collection using something like Prometheus or Azure Container Insights. You can monitor sockets/ports in use, network bandwidth consumption of chatty apps, and the number of stateful applications hosted on the cluster.

Build in security with the principle of least privilege, unified security management, and threat detection

Security is required at every layer, from code, container, cluster, and cloud/infrastructure. These are the minimum recommended security steps:

  • Follow the principle of least privilege.
  • Unify your DevOps security management across multiple pipelines.
  • Ensure contextual insights to identify and remediate your most critical risk are visible.
  • Enable detection and response to modern threats across your cloud workloads at runtime.

To help resolve problems in this area, you need tools to evaluate tools that work across your engineering and applications systems, resources, and services across clouds and hybrid (for example, Microsoft Defender for Cloud). Beyond application security, evaluate:

Permissions requirements can differ by environment. For example, in some organizations, individual teams aren't allowed to access production resources and new applications can't automatically deploy until reviews are complete. However, automated resource and app deployment and access to clusters for troubleshooting might be permitted should be allowed in dev and test environments.

Managing identity access to services, applications, infrastructure at scale can be challenging. Identity providers to create, maintain, and manage identity information. Your plan should include authentication services for applications and services and that can integrates with role-based access control authorizations (RBAC) systems at scale. For example, you can use Microsoft Entra ID to provide authentication and authorization at scale for Azure services like Azure Kubernetes Service without needing to set up permissions directly on every individual cluster.

Applications might need access to an identity to access cloud resources like storage. You need to review requirements and assess how your identity provider can support this in the most secure way possible. For example, within AKS, cloud native apps can utilize a workload identity that federates with Microsoft Entra ID to allow containerized workloads to authenticate. This approach allows applications to access cloud resources without secret exchanges within application code.

Reduce costs by identifying workload owners and tracking resources

Managing cost is another part of platform engineering. To properly manage your application platform, you need a way to identify workload owners. You want a way to get an inventory of resources that maps to owners for a particular set of metadata. For example, within Azure, you can use AKS Labels, Azure Resource Manager tags, along with concepts like projects in Azure Deployment Environments to group your resources at different levels. For this to work, the chosen metadata must include mandatory properties (using something like Azure Policy) when deploying workloads and resources. This helps with cost allocation, solution resource mapping, and owners. Consider running regular reporting to track orphaned resources.

Beyond tracking, you might need to assign cost to individual application teams for their resource usage using this same metadata with cost management systems like Microsoft Cost Management. While this method tracks resources provisioned by the application teams, it doesn't cover the cost of shared resources such as your identity provider, logging and metric storage, and networking bandwidth consumption. For shared resources, you can equally divide the operational costs by the individual teams or provide dedicated systems (example, logging storage) where there's nonuniform consumption. Some shared resource types might be able to provide insights on resource consumption, for example Kubernetes has tools such as OpenCost or Kubecost and can help.

You should also look for cost analysis tooling where you can review current spending. For example, in Azure portal there are cost alerting and budgets alerts that can track consumption of resources in a group and send notifications when you hit preset thresholds.

Decide when and where to invest

If you have more than one application platforms, it can be tricky to decide when and where to invest in improvements that solve problems like high costs or poor observability. If you're starting fresh, the Azure Architecture Center has several potential patterns for you to evaluate. But beyond that, here are a few questions to consider as you begin to plan what you want to do:

Question Tips
Do you want to adapt your existing application platform, start fresh, or use a combination of these approaches? Even if you're happy with what you have now or are starting fresh, you want want to think about how to adapt to change over time. Immediate changes rarely work. Your application platforms are a moving target. Your ideal system changes as time passes. You want to factor this thinking and any related migration plans into your go-forward design.
If you want to change what you are doing today, what products, services, or investments are you happy with? As the saying goes, "if it isn’t broken, don’t fix it." Don't change things without a reason to do so. However, if you have any home-grown solutions, consider whether it's time to move towards an existing product to save on long term maintenance. For example, if you're operating your own monitoring solution, do you want to remove that burden from your ops team and migrate to a new managed product?
Where do you see the most change happening over time? Are any of these in areas that are common to all (or most) of your organization's app types? Areas that you or your internal customers aren't happy with and aren't likely to change frequently are great places to start. These have the biggest return on investment over the long term. This can also help you iron out how you would help facilitate migrating to a new solution. For example, app models tend to be fluid, but log analysis tools tend to have a longer shelf-life. You can also focus on new projects / applications to start while you confirm that the direction change has the desired returns.
Are you investing in custom solutions in areas with the highest value-add? Do you feel strongly that a unique app infrastructure platform capability is part of your competitive advantage? If you’ve identified gaps, before doing something custom, consider which areas vendors are most likely to invest in and focus your custom thinking elsewhere. Start by thinking of yourself as an integrator rather than a custom app infrastructure or app model provider. Anything you build will have to be maintained which dwarfs up-front costs in the long term. If you feel the urgent need to custom build a solution in an area you suspect will be covered by vendors long term, plan for sunsetting or long-term support. Your internal customers will typically be as happy (if not more) with an off-the-shelf product as a custom one.

Adapting your existing application platform investments can be a good way to get going. When you make updates, consider starting with new applications to simplify piloting ideas before any kind of roll-out. Factor in this change through IaC and application templating. Invest in custom solutions for your unique needs in high impact, high value areas. Otherwise, try to use an off-the-shelf solution. As with engineering systems, focus on automating provisioning, tracking, and deployment rather than assuming one rigid path to help you manage change over time.