Get started with AI Toolkit for Visual Studio Code

2025-05-30

The AI Toolkit for VS Code (AI Toolkit) is a VS Code extension that enables you to download, test, fine-tune, and deploy AI models with your apps or in the cloud. For more information, see the AI Toolkit overview.

Note

Additional documentation and tutorials for the AI Toolkit for VS Code are available in the VS Code documentation: AI Toolkit for Visual Studio Code. You'll find guidance on Playground, working with AI models, fine-tuning local and cloud-based models, and more.

In this article, you'll learn how to:

Install the AI Toolkit for VS Code
Download a model from the catalog
Run the model locally using the playground
Integrate an AI model into your application using REST or the ONNX Runtime

Prerequisites

VS Code must be installed. For more information, see Download VS Code and Getting started with VS Code.

When utilizing AI features, we recommend that you review: Developing Responsible Generative AI Applications and Features on Windows.

Install

The AI Toolkit is available in the Visual Studio Marketplace and can be installed like any other VS Code extension. If you're unfamiliar with installing VS Code extensions, follow these steps:

In the Activity Bar in VS Code select Extensions
In the Extensions Search bar type "AI Toolkit"
Select the "AI Toolkit for Visual Studio code"
Select Install

Once the extension has been installed you'll see the AI Toolkit icon appear in your Activity Bar.

Download a model from the catalog

The primary sidebar of the AI Toolkit is organized into My Models, Catalog, Tools, and Help and Feedback. The Playground, Bulk Run, Evaluation, and Fine tuning features are available in the Tools section. To get started select Models from the Catalog section to open the Model Catalog window:

A screenshot of the AI Toolkit model catalog window in VS Code

You can use the filters at the top of the catalog to filter by Hosted by, Publisher, Tasks, and Model type. There's also a Fine-Tuning Support switch that you can toggle on to only show models that can be fine tuned.

Tip

The Model type filter allows you to only show models that will run locally on the CPU, GPU, or NPU or models that support only Remote access. For optimized performance on devices that have at least one GPU, select model type of Local run w/ GPU. This helps to find a model optimized for the DirectML accelerator.

To check whether you have a GPU on your Windows device, open Task Manager and then select the Performance tab. If you have GPU(s), they will be listed under names like "GPU 0" or "GPU 1".

Note

For Copilot+ PCs with a Neural Processing Unit (NPU), you can select models that are optimized for the NPU accelerator. The Deepseek R1 Distilled model is optimized for the NPU and available to download on Snapdragon powered Copilot+ PCs running Windows 11. For more information, see Running Distilled DeepSeek R1 models locally on Copilot+ PCs, powered by Windows AI Foundry.

The following models are currently available for Windows devices with one or more GPUs:

Mistral 7B (DirectML - Small, Fast)
Phi 3 Mini 4K (DirectML - Small, Fast)
Phi 3 Mini 128K (DirectML - Small, Fast)

Select the Phi 3 Mini 4K model and click Download:

Note

The Phi 3 Mini 4K model is approximately 2GB-3GB in size. Depending on your network speed, it could take a few minutes to download.

Run the model in the playground

Once your model has downloaded, it will appear in the My Models section under Local models. Right-click the model and select Load in Playground from the context menu:

A screenshot of the Load in Playground context menu item

In the chat interface of the playground enter the following message followed by the Enter key:

Playground selection

You should see the model response streamed back to you:

Generation response

Warning

If you do not have a GPU available on your device but you selected the Phi-3-mini-4k-directml-int4-awq-block-128-onnx model, the model response will be very slow. You should instead download the CPU optimized version: Phi-3-mini-4k-cpu-int4-rtn-block-32-acc-level-4-onnx.

It is also possible to change:

Context Instructions: Help the model understand the bigger picture of your request. This could be background information, examples/demonstrations of what you want or explaining the purpose of your task.
Inference parameters:
- Maximum response length: The maximum number of tokens the model will return.
- Temperature: Model temperature is a parameter that controls how random a language model's output is. A higher temperature means the model takes more risks, giving you a diverse mix of words. On the other hand, a lower temperature makes the model play it safe, sticking to more focused and predictable responses.
- Top P: Also known as nucleus sampling, is a setting that controls how many possible words or phrases the language model considers when predicting the next word
- Frequency penalty: This parameter influences how often the model repeats words or phrases in its output. The higher the value (closer to 1.0) encourages the model to avoid repeating words or phrases.
- Presence penalty: This parameter is used in generative AI models to encourage diversity and specificity in the generated text. A higher value (closer to 1.0) encourages the model to include more novel and diverse tokens. A lower value is more likely for the model to generate common or cliche phrases.

Integrate an AI model into your application

There are two options to integrate the model into your application:

The AI Toolkit comes with a local REST API web server that uses the OpenAI chat completions format. This enables you to test your application locally - using the endpoint http://127.0.0.1:5272/v1/chat/completions - without having to rely on a cloud AI model service. Use this option if you intend to switch to a cloud endpoint in production. You can use OpenAI client libraries to connect to the web server.
Using the ONNX Runtime. Use this option if you intend to ship the model with your application with inferencing on device.

Local REST API web server

The local REST API web server allows you to build-and-test your application locally without having to rely on a cloud AI model service. You can interact with the web server using REST, or with an OpenAI client library:

Here is an example body for your REST request:

{
    "model": "Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
    "messages": [
        {
            "role": "user",
            "content": "what is the golden ratio?"
        }
    ],
    "temperature": 0.7,
    "top_p": 1,
    "top_k": 10,
    "max_tokens": 100,
    "stream": true
}'

Note

You may need to update the model field to the name of the model you downloaded.

You can test the REST endpoint using an API tool like Postman or the CURL utility:

curl -vX POST http://127.0.0.1:5272/v1/chat/completions -H 'Content-Type: application/json' -d @body.json

Install the OpenAI Python library:

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:5272/v1/",
    api_key="x" # required by API but not used
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "what is the golden ratio?",
        }
    ],
    model="Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
)

print(chat_completion.choices[0].message.content)

Add the Azure OpenAI client library for .NET to your project using NuGet:

dotnet add {project_name} package Azure.AI.OpenAI --version 1.0.0-beta.17

Add a C# file called OverridePolicy.cs to your project and paste the following code:

// OverridePolicy.cs
using Azure.Core.Pipeline;
using Azure.Core;

internal partial class OverrideRequestUriPolicy(Uri overrideUri)
    : HttpPipelineSynchronousPolicy
{
    private readonly Uri _overrideUri = overrideUri;

    public override void OnSendingRequest(HttpMessage message)
    {
        message.Request.Uri.Reset(_overrideUri);
    }
}

Next, paste the following code into your Program.cs file:

// Program.cs
using Azure.AI.OpenAI;

Uri localhostUri = new("http://localhost:5272/v1/chat/completions");

OpenAIClientOptions clientOptions = new();
clientOptions.AddPolicy(
    new OverrideRequestUriPolicy(localhostUri),
    Azure.Core.HttpPipelinePosition.BeforeTransport);
OpenAIClient client = new(openAIApiKey: "unused", clientOptions);

ChatCompletionsOptions options = new()
{
    DeploymentName = "Phi-3-mini-4k-directml-int4-awq-block-128-onnx",
    Messages =
    {
        new ChatRequestSystemMessage("You are a helpful assistant. Be brief and succinct."),
        new ChatRequestUserMessage("What is the golden ratio?"),
    }
};

StreamingResponse<StreamingChatCompletionsUpdate> streamingChatResponse
    = await client.GetChatCompletionsStreamingAsync(options);

await foreach (StreamingChatCompletionsUpdate chatChunk in streamingChatResponse)
{
    Console.Write(chatChunk.ContentUpdate);
}

Note

If you downloaded the CPU version of the Phi3 model, you need to update the model field to Phi-3-mini-4k-cpu-int4-rtn-block-32-acc-level-4-onnx.

ONNX Runtime

The ONNX Runtime Generate API provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. You can call a high level generate() method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop.

It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. The following code is an example of how you can leverage the ONNX runtime in your applications.

Please refer to the example shown in Local REST API web server. The AI Toolkit REST web server is built using the ONNX Runtime.

Install Numpy:

pip install numpy

Next, install the ONNX Runtime Python package into your project according to your platform and GPU availability:

Platform	GPU Available	PyPI
Windows	Yes (AMD, NVIDIA, Intel, Qualcomm, plus others supported)	`pip install --pre onnxruntime-genai-directml`
Linux	Yes (Nvidia CUDA)	`pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/`
Windows Linux	No	`pip install --pre onnxruntime-genai`

Tip

We recommend installing Python packages into a virtual environment using either venv or conda.

Next, copy-and-paste the following code into a Python file named app.py:

# app.py
import onnxruntime_genai as og
import argparse

def main(args):
    print("Loading model...")
    model = og.Model(f'{args.model}')
    print("Model loaded")
    tokenizer = og.Tokenizer(model)
    tokenizer_stream = tokenizer.create_stream()
    search_options = {
        'max_length': 2048
    }

    chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

    # Keep asking for input prompts in a loop
    while True:
        text = input("Input: ")
    
        # If there is a chat template, use it
        prompt = f'{chat_template.format(input=text)}'

        input_tokens = tokenizer.encode(prompt)

        params = og.GeneratorParams(model)
        params.set_search_options(**search_options)
        params.input_ids = input_tokens
        
        generator = og.Generator(model, params)
        print("\nOutput: ", end='', flush=True)
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
              
        print()
        print()

        # Delete the generator to free the captured graph for the next generator, if graph capture is enabled
        del generator


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-m', '--model', type=str, required=True, help='Onnx model folder path (must contain config.json and model.onnx)')
    args = parser.parse_args()
    main(args)

To run the Python app use the following code:

python app.py --model ~/.aitk/models/{path_to_folder_containing_onnx_file}

Note

The AI Toolkit caches model downloads into a hidden folder named .aitk in your user directory - you'll need to update the path used for the --model parameter to the location of the folder containing the ONNX model file. For example ~/.aitk/models/microsoft/Phi-3-mini-4k-instruct-onnx/directml/Phi-3-mini-4k-directml-int4-awq-block-128-onnx/

Install the ONNX Runtime NuGet package into your project according to your platform and GPU availability:

Platform	GPU Available	Nuget
Windows	Yes (AMD, NVIDIA, Intel, Qualcomm, plus others supported)	Microsoft.ML.OnnxRuntimeGenAI.DirectML
Linux	Yes (Nvidia CUDA)	Microsoft.ML.OnnxRuntimeGenAI.Cuda
Windows Linux	No	Microsoft.ML.OnnxRuntimeGenAI

Copy-and-paste the following code into your C# file:

using Microsoft.ML.OnnxRuntimeGenAI;

// update user_name and path placeholders
string modelPath = "C:\\Users\\{user_name}\\.aitk\\models\\{path}"; 
Console.Write("Loading model from " + modelPath + "...");
using Model model = new(modelPath);
Console.Write("Done\n");
using Tokenizer tokenizer = new(model);
using TokenizerStream tokenizerStream = tokenizer.CreateStream();

while (true)
{
    Console.Write("User:");
   
    string? input = Console.ReadLine();
    string prompt = "<|user|>\n" + input + "<|end|>\n<|assistant|>";

    var sequences = tokenizer.Encode(prompt);

    using GeneratorParams generatorParams = new GeneratorParams(model);
    generatorParams.SetSearchOption("max_length", 200);
    generatorParams.SetInputSequences(sequences);

    Console.Out.Write("\nAI:");
    using Generator generator = new(model, generatorParams);
    while (!generator.IsDone())
    { 
        generator.ComputeLogits();
        generator.GenerateNextToken();
        Console.Out.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
        Console.Out.Flush();
    }
    Console.WriteLine();
}

Note

The AI Toolkit caches model downloads into a hidden folder named .aitk in your user directory - you'll need to update the modelPath in the code to the location of the folder containing the ONNX model file. For example ~/.aitk/models/microsoft/Phi-3-mini-4k-instruct-onnx/directml/Phi-3-mini-4k-directml-int4-awq-block-128-onnx/

Next Step

Fine-tune a model with AI Toolkit for VS Code

Share via

Get started with AI Toolkit for Visual Studio Code

Prerequisites

Install

Download a model from the catalog

Run the model in the playground

Integrate an AI model into your application

Local REST API web server

ONNX Runtime

Next Step

Feedback

Additional resources