This article describes the considerations for an Azure Kubernetes Service (AKS) cluster that's configured in accordance with the Payment Card Industry Data Security Standard (PCI-DSS 3.2.1).
This article is part of a series. Read the introduction.
Like any cloud solution, a PCI workload is subject to network, identity, and data threats. Common examples of sources that take advantage of workload and system vulnerabilities are viruses or software updates that produce undesirable results. Detect threats early and respond with mitigation in a timely manner. Build critical alerts for workload activities and extend those alerts to core system processes. Antivirus or file-integrity monitoring (FIM) tools must be always running. Have an accountable response plan and a team that investigates the alerts and takes action.
The guidance and the accompanying implementation builds on the AKS baseline architecture. That architecture based on a hub-and-spoke topology. The hub virtual network contains the firewall to control egress traffic, gateway traffic from on-premises networks, and a third network for maintenance. The spoke virtual network contains the AKS cluster that provides the cardholder data environment (CDE) and hosts the PCI DSS workload.
GitHub: Azure Kubernetes Service (AKS) Baseline Cluster for Regulated Workloads demonstrates a regulated infrastructure. The implementation illustrates the setup of security tooling at various phases of the architecture and development lifecycle. This includes examples of bring-your-own in-cluster security agents and Azure-provided security tooling, for instance Microsoft Defender for Cloud.
Maintain a Vulnerability Management Program
Requirement 5—Protect all systems against malware and regularly update anti-virus software or programs
AKS feature support
AKS doesn't behave like a traditional application host. Node VMs in an AKS cluster have limited exposure and are designed to not be accessed directly. Because node VMs do not equate to traditional VMs, you can't use common VM tools. So, the recommendations in this section are applied through native Kubernetes constructs. Applying these requirements directly at the VM level might cause your cluster to be out of support.
You'll have to deploy antimalware software of your choice in DaemonSets that will run in a pod on every node.
Make sure the software is specialized in Kubernetes and containers. There are several third-party software options. Popular choices include Prisma Cloud and Aquasec. There are also open-source options such as Falco. It's your responsibility to make sure that there are processes in place to make sure the third-party software is up to date. Also, monitoring and alerting of the solutions is your responsibility.
|Requirement 5.1||Deploy anti-virus software on all systems commonly affected by malicious software (particularly personal computers and servers).|
|Requirement 5.2||Ensure that all anti-virus mechanisms are maintained as follows:|
|Requirement 5.3||Ensure that anti-virus mechanisms are actively running, and cannot be disabled or altered by users, unless specifically authorized by management on a case-by-case basis for a limited time period.|
|Requirement 5.4||Ensure that security policies and operational procedures for protecting systems against malware are documented, in use, and known to all affected parties.|
Requirement 6—Develop and maintain secure systems and applications
AKS feature support
Much like other Azure services, AKS follows Microsoft SDL (Security Development Lifecycle) processes for security throughout the phases of the development process. Various components are scanned starting in the early stages of development and security gaps are covered as early as possible.
AKS images follow a FedRAMP SLA approach, which requires vulnerabilities in images to be patched within 30 days. To enforce this requirement, all images are sanitized in through a DevSecOps pipeline.
Weekly, AKS provides new images for the node pools. It's your responsibility to apply them to ensure patching and updating of Virtual Machine Scale Sets worker nodes. For more information, see Azure Kubernetes Service (AKS) node image upgrade.
For the AKS control plane, AKS installs or upgrades the security patches. They're updated every 24 hours.
The AKS control plane and worker nodes are hardened against Center for Internet Security (CIS). Specifically AKS CIS, Ubuntu CIS, and Windows CIS.
AKS is integrated with Azure Container Registry (ACR). Use ACR with continuous scanning features in Microsoft Defender for Cloud to identify vulnerable images and applications at various risk levels. For information about image scan and risk control, see Microsoft Defender for Containers.
|Requirement 6.1||Establish a process to identify security vulnerabilities, using reputable outside sources for security vulnerability information, and assign a risk ranking (for example, as "high", "medium", or "low") to newly discovered security vulnerabilities.|
|Requirement 6.2||Ensure that all system components and software are protected from known vulnerabilities by installing applicable vendor-supplied security patches. Install critical security patches within one month of release.|
|Requirement 6.3||Develop internal and external software applications (including web-based administrative access to applications) securely.|
|Requirement 6.4||Follow change control processes and procedures for all changes to system components.|
|Requirement 6.5||Address common coding vulnerabilities in software-development processes.|
|Requirement 6.6||For public-facing web applications, address new threats and vulnerabilities on an ongoing basis and ensure these applications are protected against known attacks.|
|Requirement 6.7||Ensure that security policies and operational procedures for developing and maintaining secure systems and applications are documented, in use, and known to all affected parties.|
Deploy anti-virus software on all systems commonly affected by malicious software, in particular personal computers and servers.
It's your responsibility to protect the workload, the infrastructure, and the deployment pipelines by choosing an appropriate antimalware software.
Because access to AKS node VMs is restricted, protect the system at layers that can inject malware to node VMs. Include detection and prevention at cluster nodes, container images, and runtime interactions with the Kubernetes API server. In addition to the cluster, protect these components that interact with cluster and can have antivirus software installed in a traditional way:
- Jump boxes
- Build agents
Align your scanning activities with the Security Development Lifecycle (SDL). Following the SDL makes sure scanning various components of the architecture start in the early stages of development and security gaps are covered as early as possible.
Make that anti-virus programs are capable of detecting, removing, and protecting against all known types of malicious software.
Learn about the feature set of each software offering and the depth of scanning that it can do. The software should block common threats and monitor new threats. Make sure the software is regularly updated, tested, and replaced if it's found unsuitable. Consider software developed by reputable vendors.
Monitoring tools that detect cluster vulnerabilities.
In AKS, you can't run traditional agent-based VM solutions directly on node VMs. You'll have to deploy antimalware software in DaemonSets that will run in a pod on every node.
Choose software that's specialized in Kubernetes and containers. There are several third-party software options. Popular choices include Prisma Cloud and Aquasec. There are also open-source options such as Falco.
When deployed, they run as agents in the cluster that scans all user and system node pools. Even though AKS uses system node pools for its runtime system binaries, the underlying compute is still your responsibility.
The purpose of running the agent is to detect unusual cluster activities. For example, is an application trying to call the API server? Some solutions generate a log of API calls between pods, generate reports, and generate alerts. Make sure you review those logs and take necessary actions.
Install security agents immediately after bootstrapping of the cluster to minimize unmonitored gaps between the cluster and AKS resource deployment.
Security agents run with high privileges, and they scan everything that runs on the cluster and not just the workload. They must not become a data exfiltration source. Also, supply chain attacks are common for containers. Use defense-in-depth strategies and make sure the software and all the dependencies are trusted.
Also run antivirus software on external assets that participate in cluster operations. Some examples include jump boxes, build agents, and container images that interact with the cluster.
When the agent scans, it shouldn't block or interfere with the critical operations of the cluster, such as by locking files. Misconfiguration might cause stability issues and could render your cluster out of support.
The reference implementation provides a placeholder
DaemonSetdeployment to run an antimalware agent. The agent will run on every node VM in the cluster. Place your choice of antimalware software in this deployment.
Maintaining container safety. Run container-scanning tools in the pipeline to detect threats that might come through container images, such as the CI/CD vulnerability scanning in Microsoft Defender for Containers. Third-party choices include Trivy and Clair. When you're building images, always strive for distroless images. These images only contain the essential binaries in the base Linux image and reduce the surface area for attacks. Use a continuous scanning solution like vulnerability assessment in Microsoft Defender for Containers for ongoing scanning of images already at rest in your repositories.
For systems not commonly targeted or affected by malicious software, perform periodic evaluations to identify and evaluate evolving malware threats to confirm whether they continue to not require anti-virus software.
Common vulnerabilities might affect components outside the cluster. Keep track of security vulnerabilities by watching CVEs and other security alerts from the Azure platform. Check for Azure updates for new features that can detect vulnerabilities and run antivirus solutions on Azure-hosted services.
For example, blob storage should have malware reputation screening to detect suspicious uploads. A new feature, Microsoft Defender for Storage, includes malware reputation screening. Also, consider whether an antivirus solution is required for such a service.
Make certain that all anti-virus mechanisms are maintained as follows:
- Are kept current,
- Perform periodic scans
- Generate audit logs, which are retained per PCI DSS Requirement 10.7.
- Make sure the cluster is protected against new attacks by using the latest version of antivirus software. There are two types of updates to consider:
- The antivirus software must keep up with the latest feature updates. One way is to schedule updates as part of your platform updates.
- Security intelligence updates must be applied as soon as they're available to detect and identify the latest threats. Opt for automatic updates.
- Validate that the vulnerability scans are running, as scheduled.
- Retain logs that are generated as a result of scanning that indicates healthy and unhealthy components. The recommended retention period is given in Requirement 10.7, which is a year.
- Have a process in place that triages and remediates the detected issues.
For information about how Microsoft Defender antivirus updates are applied, see Manage Microsoft Defender antivirus updates and apply baselines.
Anti-virus features should be actively running and can't be disabled or altered by users. Except when authorized by management on a case-by-case basis for a limited time period.
You're responsible for setting up monitoring and alerting of the security agent. Build critical alerts for not only the workload but also the core system processes. The agent must be running always. Respond to the alerts raised by the antimalware software.
- Keep a log trail of scanning activities. Make certain that the scanning process doesn't log any cardholder data scraped from disk or memory.
- Set alerts for activities that might cause an unexpected lapse in compliance. The alerts shouldn't be turned off inadvertently.
- Restrict the permissions to modify the deployment of the agent (and other critical security tooling). Keep those permissions separate from the workload deployment permissions.
- Do not deploy workloads if the security agents aren't running as expected.
Verify that security policies and operational procedures for protecting systems against malware are documented, used, and communicated to all affected parties.
It's critical that you maintain thorough documentation about the process and policies, especially details about the antivirus solution used to protect the system. Include information such as where in the product cycle the security intelligence updates are maintained, the frequency of the scans, and information about real-time scanning capabilities.
Have retention policies for storing logs. You might want to have long-term storage for compliance purposes.
Maintain documentation about standard operating procedures for assessing and remediating issues. People who operate regulated environments must be educated, informed, and incentivized to support the security assurances. This is important for people who are part of the approval process from a policy perspective.
Establish a process to identify security vulnerabilities, using reputable outside sources for security vulnerability information, and assign a risk ranking (for example, as high, medium, or low) to newly discovered security vulnerabilities.
Have processes that check the detected vulnerabilities and are ranked appropriately. Microsoft Defender for Cloud shows recommendations and alerts, based resource type and its severity, environment. Most alerts have MITRE ATT&CK® tactics that can help you understand the kill chain intent. Make sure you have a remediation plan to investigate and mitigate the problem.
In AKS, you can use Azure Container Registry in combination with continuous scanning to identify vulnerable images and applications at various risk levels. You can view the results in Microsoft Defender for Cloud.
For more information, see Container Registry.
Ensure that all system components and software are protected from known vulnerabilities by installing applicable vendor-supplied security patches. Install critical security patches within one month of release.
To prevent supply chain attacks from third-party vendors, make sure all the dependencies are trusted. It's important that you choose a vendor that's reputable and trusted.
Weekly, AKS provides new images for the node pools. Those images aren't applied automatically. Apply them as soon as they're available. You can update manually or automatically through Node Image Update. For more information, see Azure Kubernetes Service (AKS) node image upgrade
For the AKS control plane, AKS installs or upgrades the security patches.
Every 24 hours, AKS nodes automatically download and install operating system and security patches, individually. Your firewall must not block this traffic if you want to receive those updates.
Consider enabling reporting capabilities on the security agent to get information about the applied updates. Some security updates require a restart. Be sure to review the alerts and take action to ensure minimum or zero application downtime with those restarts. An open-source option to perform restarts in a controlled manner is Kured (Kubernetes reboot daemon).
Extend the patching process to resources outside the cluster that you provision, such as jump boxes and build agents.
Stay current with the supported AKS versions. If your design uses a version that has reached the end of life, upgrade to a current version. For more information, see Supported AKS versions.
Develop internal and external software applications (including web-based administrative access to applications) securely, as follows:
- In accordance with PCI DSS (for example, secure authentication and logging)
- Based on industry standards and/or best practices.
- Incorporating information security throughout the software-development life cycle that applies to all software developed internally, including bespoke or custom software developed by a third party.
Integrate and prioritize security choices as part of the workload life cycle and operations.
Several industry frameworks map to the life cycle, such as the NIST framework. NIST functions—Identify, Protect, Detect, Respond, and Recover—provide strategies for preventive controls in each phase.
Microsoft SDL (Security Development Lifecycle) provides best practices, tools, and processes for security throughout the phases of the development process. Microsoft SDL practices are followed for all Azure services, including AKS. We also follow the Operational Security Assurance (OSA) framework for operating cloud services. Ensure that you have a similar process. These practices are published to help you secure your applications.
Remove development, test and/or custom application accounts, user IDs, and passwords before applications become active or are released to customers.
As part of cluster creation, multiple local Kubernetes users are created by default. Those users can't be audited because they don't represent a unique identity. Some of them have high privileges. Disable those users by using the Disable local accounts feature of AKS.
For other considerations, refer to the guidance in the official PCI-DSS 3.2.1 standard.
Review custom code prior to release to production or customers in order to identify any potential coding vulnerability (using either manual or automated processes) to include the following:
- Code changes are reviewed by individuals other than the originating code author, and by individuals knowledgeable about code-review techniques and secure coding practices.
- Code reviews ensure code is developed according to secure coding guidelines
- Appropriate corrections are implemented prior to release.
- Code-review results are reviewed and approved by management prior to release.
All software installed in the cluster is sourced from your container registry. Similar to the application code, have processes and people scrutinize Azure and third-party images (DockerFile and OCI). Also:
Start scanning container images from the initial stages when the cluster is created. Make the scanning process a part of your continuous integration/continuous deployment pipelines.
Ensure your deployment pipelines are gated in such a way that both cluster bootstrapping images and your workload have passed through a review and/or quarantine gate. Maintain history about how and what processes were used before they were pulled to the cluster.
Reduce the image size. Usually, images contain more binaries than what's required. Reducing the image size not only has performance benefits but also limits the attack surface. For example, using distroless will minimize the base Linux images.
Use static analysis tools that verify the integrity of the Dockerfile and the manifests. Third-party options include Dockle and Trivy.
Only use signed images.
Understand (and accept) the base image provided by Azure and how it complies with CIS Benchmarks. For more information, see Center for Internet Security (CIS) Benchmarks.
Azure Container Registry with continuous scanning in Microsoft Defender for Cloud will help identify vulnerable images and various risks it can pose to the workload. For more information about image scan and risk control, see Container security.
Follow change control processes and procedures for all changes to system components.
Make sure you document change control processes and design the deployment pipelines according to those processes. Include other processes for detecting situations where the processes and actual pipelines do not align.
Requirement 6.4.1, 6.4.2
- Separate development/test environments from production environments, and enforce the separation with access controls.
- Separation of duties between development/test and production environments.
Maintain separate production and pre-production environments and roles that operate in those environments.
Don't use your production cluster for development/test purposes. For example, don't install bridge to Kubernetes in your production clusters. Use dedicated clusters for non-production workloads.
Make sure that your production environments don't allow network access to pre-production environments, and vice versa.
Don't reuse system identities in pre-production and production environments.
Use Azure AD groups for groups such as cluster administrators or pipeline principals. Don't use generalized or common groups as access controls. Don't reuse those groups between pre-production and production clusters. One way is to use the cluster name (or other opaque identifier) in the group name, to be explicit on memberships.
Use Azure role-based access control (RBAC) roles appropriately between environments. Typically, more roles and rights are assigned pre-production environments.
Identities in pre-production only (granted to pipelines or software engineering teams) shouldn't be granted access in production. Conversely, any production-only identities (such as pipelines) shouldn't be granted access in pre-production clusters.
Do not use the same user-managed identity for any resource in pre-production and in production. This recommendation applies to all resources that support user-managed identity, not just the resource deployed in your cluster. As a rule, Azure resources that require identities should have their own distinct identity instead of sharing it with other resources.
Use just-in-time (JIT) access for high-privilege access, including on pre-production clusters if possible. Use conditional access policies on both pre-production and production clusters.
Production data (live PANs) aren't used for testing or development.
Make sure CHD data doesn't flow into the dev/test environment. Have clear documentation that provides the procedure for moving data from production to dev/test. Removal of real data must be included in that procedure and approved by accountable parties.
Removal of test data and accounts from system components before the system becomes active / goes into production.
Remove default configuration data, sample data, and well-known test data in the system before deploying to production. Do not use cardholder data for test purposes.
Change control procedures for the implementation of security patches and software modifications must include the following:
- 220.127.116.11 Documentation of impact.
- 18.104.22.168 Documented change approval by authorized parties.
- 22.214.171.124 Functionality testing to verify that the change does not adversely impact the security of the system.
- 126.96.36.199 Back-out procedures.
These guidance points map to the preceding requirements:
Document the infrastructure changes that are expected as a result of the security patches and software modifications. That process is easier with the infrastructure-as-code (IaC) approach. For example, with an Azure Resource Manager template (ARM template) for deployment, you can preview the changes with a what-if operation. For more information, see ARM template deployment what-if operation for your infrastructure changes.
Implement gates in your deployment pipelines that validate approval of changes for regular deployments. Document the justification for emergency deployments where gates might have been bypassed.
Define the levels and depth of changes. Make sure the team agrees on the definition of significant changes, as opposed to minor changes. If practical, automate the discovery of some of those changes. Reviewers for the workload, infrastructure, and pipeline must have a clear understanding of the levels and validate against those criteria.
Test the security affordances. Make sure synthetic transactions are testing security (both allow and deny) concerns. Also make sure that those synthetic tests are running in pre-production environments.
Have a back-out process in case a security fix has unexpected results. A common strategy is to deploy a prior state by using blue-green deployments. For workloads, including databases, have a strategy that works for your specific topology and is scoped to your units of deployment.
Address common coding vulnerabilities in software-development processes as follows:
- Train developers at least annually in up-to-date secure coding techniques, including how to avoid common coding vulnerabilities.
- Develop applications based on secure coding guidelines.
It's critical that application teams and operations teams are educated, informed, and incentivized to support scanning activities of the workload and infrastructure. Here are some resources:
For public-facing web applications, address new threats and vulnerabilities on an ongoing basis. Make sure these applications are protected against known attacks by either of the following methods:
Review public-facing web applications using manual or automated application vulnerability security assessment tools or methods. Perform a vulnerability assessment at least annually and after any changes.
This assessment isn't the same as the vulnerability scans performed as part of Requirement 11.2.
Install an automated solution that detects and prevents web-based attacks. For example, a web-application firewall. Deploy in front of public-facing web applications and actively evaluate all traffic.
Have checks in place to detect traffic coming from the public internet by using a web application firewall (WAF). In this architecture, Azure Application Gateway checks all incoming traffic by using its integrated WAF. The WAF is based on Core Rule Set (CRS) from the Open Web Application Security Project (OWASP). If technical controls aren't in place, have compensating controls. One way is through manual code inspection.
Make sure you're using the latest versions of the rule set, and apply rules that are relevant to your workload. The rules should run in Prevention mode. You can enforce that requirement by adding an Azure Policy instance that checks if the WAF is enabled and is operating in that mode.
Keep logs generated by the Application Gateway WAF to get details about the detected threats. Fine-tune the rules as needed.
Conduct penetration testing focused on the application code. That way, practitioners who aren't part of the application team will find security gaps (such as SQL injection and directory traversal) by gathering information, analyzing vulnerabilities, and reporting. In this exercise, the practitioners might need access to sensitive data. To make sure intent isn't misused, follow the guidance provided in Penetration Testing Rules of Engagement.
Make sure that security policies and operational procedures for developing and maintaining secure systems and applications are documented, in use, and known to all affected parties.
It's critical that you maintain thorough documentation about the processes and policies. Your teams should be trained to prioritize security choices as part of the workload life cycle and operations.
Microsoft SDL provides best practices, tools, and processes for security throughout the phases of the development process. Microsoft SDL practices are followed strictly internally in how we build software at Microsoft. We also follow the Operational Security Assurance (OSA) framework for operating cloud services. These practices are published to help you secure your applications.
Keep thorough documentation for penetration testing that describes the scope of the test, triage processes, and remediation strategy for the detected issues. If an incident happens, incorporate the evaluation of Requirement 6 as part of root-cause analysis. If gaps are detected (for example, an OWASP rule violation is detected), close those gaps.
In the documentation, have clear guidelines about the expected WAF protection status.
People who are operating regulated environments must be educated, informed, and incentivized to support the security assurances. It is important for people who are part of the approval process from a policy perspective.
Restrict access to cardholder data by business need to know. Identify and authenticate access to system components. Restrict physical access to cardholder data.