Deploy and test inference models with the AI toolchain operator (KAITO) in Visual Studio Code

2025-06-23

In this article, you learn how to use the AI toolchain operator (KAITO) add-on in the Azure Kubernetes Service (AKS) extension for Visual Studio Code. KAITO automatically provisions the right-sized GPU nodes and sets up the inference server as an endpoint server to your AI model(s), allowing you to test and experiment with AI on AKS with ease.

Prerequisites

The Azure Kubernetes Service (AKS) extension for Visual Studio Code needs to be installed to use the KAITO experience. For more information, see Install the Azure Kubernetes Service (AKS) extension for Visual Studio Code.
The cluster that you are deploying to is a Standard Cluster (Kaito cannot currently be installed on Automatic clusters).
Verify that your Azure subscription has GPU quota for your chosen model by checking the KAITO model workspaces.

Install KAITO on your cluster

In the Kubernetes tab, under Clouds > Azure > your subscription > Deploy a LLM with KAITO, right click on your cluster and select Install KAITO.
Once on the page, select Install KAITO to start the KAITO installation process.
When the installation completes, you will see a Generate Workspace button that redirects you to the model deployment page.

Create a KAITO workspace

When creating a KAITO workspace, you can either deploy the default workspace CRD directly into your AKS cluster or save the CRD and customize it for your needs.

In the Kubernetes tab, under Clouds > Azure > your subscription > Deploy a LLM with KAITO, right click on your cluster and select Create KAITO workspace.
Find and select the model you want to deploy.
Select Deploy default workspace CRD or Customize workspace CRD.
Select Deploy default workspace CRD to deploy the model. It tracks the progress of the model and notifies you once the model successfully deploys. It also notifies you if the model was already deployed unsuccessfully onto your cluster.
When the deployment completes, you see a View Deployed Models button that redirects you to the deployment management page.

Manage KAITO models

The Manage KAITO models page allows you to see all models deployed in your AKS cluster along with their status (ongoing, successful, or failed).

In the Kubernetes tab, under Clouds > Azure > your subscription > Deploy a LLM with KAITO, right click on your cluster and select Manage KAITO models.
From this page, you can choose to perform one of the following actions:
- Get logs: Select Get Logs to access the latest logs from the KAITO workspace pods for your deployment. This action generates a new text file containing the most recent 500 lines of logs.
- Delete a model: Select Delete Workspace (or Cancel for ongoing deployments). For failed deployments, select Redeploy Default CRD to remove the current deployment and restart the model deployment process from scratch.
- Test a model: Select Test. This action brings you to a new page where you can interact with the deployed model through a chat interface.

Test your model

In the Kubernetes tab, under Clouds > Azure > your subscription > Deploy a LLM with KAITO, right click on your cluster and select Manage KAITO models.
Select Test. This action brings you to a new page where you can interact with the deployed model through the Prompt box chat interface.
You can optionally adjust the parameters:
- Temperature: Controls the randomness of the model's output. A low temperature is good for tasks needing precision, like math problems, while a high temperature is better for tasks like creative writing.
- Top P: Limits the next-word choices to a dynamic subset of the vocabulary, determined by a cumulative probability threshold.
- Top K: Limits the next-word selection to the top K most probable words. Smaller K values lead to more predictable outputs, while larger values increase variability.
- Repetition Penalty: Penalizes the model for repeating the same phrases, words, or sequences. This is useful for avoiding repetitive or looping outputs, especially in longer generations.
- Max Length: Defines the maximum number of tokens (words or subwords) in the generated output.

For more information, see AKS extension for Visual Studio Code features.

Delete your model inference deployment

Once you've finished testing the model(s) and you want to free up the allocated GPU resources on your cluster, go to the Kubernetes tab, and under Clouds > Azure > your subscription > Deploy a LLM with KAITO, right click on your cluster and select Manage KAITO models.
For each deployed model, select Delete Workspace to clear all allocated resources created by the inference deployment.

Product support and feedback

If you have a question or want to offer product feedback, please open an issue on the AKS extension GitHub repository.

Next steps

To learn more about other AKS add-ons and extensions, see Add-ons, extensions, and other integrations for AKS.