Edit

Share via


Agentic CLI for AKS frequently asked questions

This article provides answers to some of the most common questions about the agentic CLI for Azure Kubernetes Service (AKS).

What is the agentic CLI for AKS?

The agentic CLI for AKS is an AI-powered command-line tool designed to help AKS users troubleshoot cluster issues efficiently. It analyzes telemetry signals (logs, metrics, events), correlates them across infrastructure and workloads, and provides actionable insights. The agent takes natural language queries as input and returns diagnostic summaries, root cause analyses, and remediation suggestions. The agentic CLI doesn't include the AI models, so you need to provide your own large language model (LLM) API keys for the agent to work.

What can the agentic CLI for AKS do?

The agentic CLI for AKS acts as a local assistant that interprets natural language queries, runs diagnostic commands, and returns actionable insights. It integrates seamlessly with AKS-native tools and telemetry sources such as Kubernetes events, logs, Inspektor Gadget, Azure, and AKS APIs. Each of them is enabled as toolsets natively in az aks agent.

The agent respects Azure role-based access control (RBAC) and identity controls as it inherits the users' permissions from the Azure CLI. It operates in read-only mode by default. You can configure your AI provider (for example, OpenAI, Azure OpenAI, and Anthropic) and the model. You can also configure the agent to output the toolset outputs.

The outputs of az aks agent include:

  • An AI-synthesized summary response to the user query.
  • Root cause analysis with supporting evidence.
  • Remediation suggestions tailored to AKS best practices.
  • The diagnostic traces and tool outputs.

What are the intended uses for the agentic CLI for AKS?

The agentic CLI for AKS has the following intended uses:

  • Human-in-the-loop interactions with your AKS clusters to help you efficiently detect, diagnose, and resolve issues.
  • Read-only interactions with the Kubernetes and AKS APIs. You can get resource information, understand the health of AKS cluster resources, and follow general Kubernetes and AKS best practices.

The agentic CLI for AKS isn't meant to be used as a generic coding or AI agent beyond the scope of AKS interactions. It can't access the internet to answer generic questions.

The agentic CLI for AKS is optimized for AKS-specific scenarios. It integrates with tools like kubectl, the Azure CLI, Inspektor Gadget, and Azure Monitor, but it can make mistakes. The agent might occasionally miss subtle signals, misinterpret noisy telemetry, or suggest mitigations that require human validation. For example, it might misattribute a Domain Name System (DNS) failure to a network policy when the root cause is a misconfigured upstream DNS server. This scenario might occur especially if telemetry is incomplete or permissions are restricted.

To avoid automation bias, you should treat the agent's output as a helpful starting point and not a final verdict. It excels at surfacing likely causes and guiding investigation, but human oversight is essential. Human review is necessary in complex or high-stakes environments.

As for AI models, we recommend that you use an Azure OpenAI deployed model, such as GPT4o or GPTo3. You can also use one directly from the OpenAI API platform. You can use any LLM model provider supported by Open API specifications, such as Anthropic and Gemini.

How was the agentic CLI for AKS evaluated? What metrics are used to measure performance?

The agentic CLI for AKS is being evaluated through a combination of internal testing and programmatic evaluations designed to ensure that its diagnostic capabilities are accurate, relevant, and meaningful.

For programmatic evaluations, we measured standard responsible AI metrics such as groundedness, UPIA and XPIA jailbreak, harmful content, and conversation quality (such as coherence and fluency).

These tests help us identify gaps in reasoning, tooling integration, and prompt execution. A core metric for success is the accuracy of the agent's diagnosis and the relevance of its recommendations. Did the agent correctly identify the root cause and suggest actionable, context-aware mitigations?

We conduct internal bug bashes and red teams to rigorously test the agent's behavior across various cases. We check for node health degradation, DNS failures, upgrade disruptions, and pod scheduling problems.

We recognize the dynamic nature of agentic-AI interactions, and we welcome your feedback as part of the preview. You can share feedback directly with us at aksagentcli@service.microsoft.com. You can also open a GitHub issue.

What are the limitations of the agentic CLI for AKS? How can I minimize the effect of these limitations when I use the system?

The agentic CLI for AKS is powerful and purpose built for diagnosing and resolving issues in AKS clusters. It has a few important limitations that you should be aware of to ensure effective and responsible use:

  • The agent's ability to access and analyze data is directly dependent on your permissions and the availability of telemetry. If you lack sufficient access rights, or if telemetry sources such as logs, metrics, or events are missing or incomplete, the agent might not be able to generate accurate or complete diagnostics.
  • The system is subject to token limits when processing large datasets, such as time-series metrics. These limitations can constrain the depth or breadth of analysis in complex troubleshooting scenarios.
  • In its current MVP state, the agentic CLI offers limited support for managed Azure experiences. Certain workflows, such as Azure Monitor alerts integration, might not be fully supported.

To minimize the effect of these limitations, you can take several proactive steps:

  • Ensure that required diagnostic tools, such as Azure Monitor, are properly configured to help the agent access richer telemetry and perform more comprehensive diagnostics.
  • Extend the capabilities of the agentic CLI by using it with Azure Model Context Protocol (MCP) or AKS MCP servers. For more information, see Integrate the AKS MCP server with the agentic CLI for AKS.
  • Use the latest-generation reasoning or general-purpose models, such as GPT4o and GPTo3, to ensure the best possible outcomes. The agentic CLI for AKS doesn't come with AI models included.

What operational factors and settings allow for effective and responsible use of the agentic CLI for AKS?

To use the agentic CLI for AKS effectively and responsibly, several operational settings play a key role. The agent is designed to operate in read-only mode by default, which ensures safe diagnostics without making changes to the cluster. When write operations are needed, such as deploying debug pods or executing remediation steps, they require explicit user approval to maintain user control and minimize unintended effects.

The agent runs locally on your machine and also supports bring-your-own AI providers. For this reason, you can configure your own LLM API keys. This setup ensures that you can bring your organization's approved AI providers and endpoints. All data processing happens locally to preserve data privacy and align with enterprise security standards.

The agent also offers configurable verbosity settings, which you can use to toggle between concise summaries and detailed diagnostic outputs depending on your needs. This flexibility supports the gathering of both quick insights and full transparency into the agent's reasoning and tool execution.

Integration with Azure identity and RBAC further ensures that the agent accesses only resources that you're authorized to view. This restriction simplifies setup and enforces secure access boundaries. Together, these settings create a secure, privacy-conscious, and user-controlled environment for troubleshooting AKS clusters with AI assistance.

How can I provide feedback or get help with the agentic CLI for AKS?

You can provide feedback or get help with the agentic CLI for AKS through several channels:

  • GitHub issues and pull requests on the agentic CLI repository.
  • Internal channels during the preview phase.
  • Azure support tickets or direct engagement with the AKS product team.

What are plugins, and how does the agentic CLI for AKS use them?

In the context of the agentic CLI for AKS, plugins are modular extensions that enhance the agent's diagnostic capabilities by integrating external tools, data sources, and domain-specific logic into its troubleshooting workflows. These plugins allow the agent to go beyond static command execution and incorporate dynamic, scenario-aware reasoning. The agent supports the following types of plugins:

  • Toolset integrations: You can extend the capabilities of the agent with toolsets that connect to observability platforms like Prometheus, Datadog, and Azure Monitor. These toolsets expose metrics, logs, and alerts that the agent can query and analyze in real time. For instance, a Prometheus toolset might allow the agent to fetch CPU and memory usage trends for a failing pod. An Azure Monitor integration could surface recent alerts or activity logs relevant to a node health issue.
  • MCP servers: Model Context Protocol servers act as intermediaries that expose diagnostic tools and prompt templates to AI agents. In the CLI agent for AKS, MCP servers provide structured access to Kubernetes and Azure resources. The agent can then run commands like kubectl describe and az aks show or even deploy debug pods. These servers also help standardize how tools are invoked and how data is returned, which makes it easier to scale the agent's capabilities across environments.

What data can the agentic CLI for AKS provide to plugins? What permissions do plugins have?

All the plugins are pull only. The tools allow the agentic CLI for AKS to pull data from various sources or use the custom runbooks that it embeds as part of the LLM prompts to improve its diagnostic capabilities. The only outward dataflow is to the AI models that you connect to the agentic CLI for AKS.

What kinds of issues might arise when I use the agentic CLI for AKS enabled with plugins?

When you use the agentic CLI for AKS with plugins, several types of issues might arise that can affect the reliability or accuracy of the troubleshooting experience.

One common challenge is the incorrect invocation of tools because of misconfigured prompts. Plugins often rely on prompt templates to guide the AI's reasoning and tool selection. Even small errors in prompt logic or structure can lead to the wrong tools being triggered or the right tools being used in the wrong context. The result might be misleading diagnostics or incomplete investigations.

Another risk is the generation of fabricated or incorrect outputs, especially when plugins return incomplete, outdated, or ambiguous data. In such cases, the AI might attempt to "fill in the gaps" with plausible-sounding but incorrect explanations. Errors can also occur when telemetry is missing or when the plugin is used in a cluster configuration it doesn't support. For example, a private cluster might lack access to certain APIs or tools.

To mitigate these risks, the agentic CLI for AKS includes several safeguards. Verbose logging and error reporting can help you trace exactly what tools were invoked, what data was returned, and how the AI interpreted it. The reports make it easier to spot and correct issues. You can also manually override or disable specific plugins if you suspect they're causing problems or returning unreliable data.

Finally, clear documentation and community support are essential for plugin development and maintenance. Well-documented plugins with examples, version compatibility notes, and known limitations help you understand how to use them responsibly and contribute improvements when needed. Using the latest generation LLM/reasoning models from leading AI providers also reduces the risk of incorrect information.