Compile Hugging Face models to run on Foundry Local

2025-05-20

Important

Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

Foundry Local runs ONNX models on your device with high performance. While the model catalog offers out-of-the-box precompiled options, you can use any model in the ONNX format.

To compile existing models in Safetensor or PyTorch format into the ONNX format, you can use Olive. Olive is a tool that optimizes models to ONNX format, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.

This guide shows you how to:

Convert and optimize models from Hugging Face to run in Foundry Local. You'll use the Llama-3.2-1B-Instruct model as an example, but you can use any generative AI model from Hugging Face.
Run your optimized models with Foundry Local

Prerequisites

Python 3.10 or later

Install Olive

Olive is a tool that optimizes models to ONNX format.

Bash
PowerShell

pip install olive-ai[auto-opt]

pip install olive-ai[auto-opt]

Tip

For best results, install Olive in a virtual environment using venv or conda.

You optimize the Llama-3.2-1B-Instruct model, which requires Hugging Face authentication:

Bash
PowerShell

huggingface-cli login

huggingface-cli login

Note

You must first create a Hugging Face token and request model access before proceeding.

Compile the model

Step 1: Run the Olive auto-opt command

Use the Olive auto-opt command to download, convert, quantize, and optimize the model:

Bash
PowerShell

olive auto-opt \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --trust_remote_code \
    --output_path models/llama \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

olive auto-opt `
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct `
    --trust_remote_code `
    --output_path models/llama `
    --device cpu `
    --provider CPUExecutionProvider `
    --use_ort_genai `
    --precision int4 `
    --log_level 1

Note

The compilation process takes approximately 60 seconds, plus extra time for model download.

The command uses the following parameters:

Parameter	Description
`model_name_or_path`	Model source: Hugging Face ID, local path, or Azure AI Model registry ID
`output_path`	Where to save the optimized model
`device`	Target hardware: `cpu`, `gpu`, or `npu`
`provider`	Execution provider (for example, `CPUExecutionProvider`, `CUDAExecutionProvider`)
`precision`	Model precision: `fp16`, `fp32`, `int4`, or `int8`
`use_ort_genai`	Creates inference configuration files

Tip

If you have a local copy of the model, you can use a local path instead of the Hugging Face ID. For example, --model_name_or_path models/llama-3.2-1B-Instruct. Olive handles the conversion, optimization, and quantization automatically.

Step 2: Rename the output model

Olive places files in a generic model directory. Rename it to make it easier to use:

Bash
PowerShell

cd models/llama
mv model llama-3.2

cd models/llama
Rename-Item -Path "model" -NewName "llama-3.2"

Step 3: Create chat template file

A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.

Foundry Local requires a chat template JSON file called inference_model.json in order to generate the appropriate responses. The template properties are the model name and a PromptTemplate object, which contains a {Content} placeholder that Foundry Local injects at runtime with the user prompt.

{
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  }
}

To create the chat template file, you can use the apply_chat_template method from the Hugging Face library:

Note

The following example uses the Python Hugging Face library to create a chat template. The Hugging Face library is a dependency for Olive, so if you're using the same Python virtual environment you don't need to install. If you're using a different environment, install the library with pip install transformers.

# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/llama/llama-3.2"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]


template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

Run the script using:

python generate_inference_model.py

Run the model

You can run your compiled model using the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:

Bash
PowerShell

foundry cache cd models
foundry cache ls  # should show llama-3.2

foundry cache cd models
foundry cache ls  # should show llama-3.2

Caution

Remember to change the model cache back to the default directory when you're done by running:

foundry cache cd ./foundry/cache/models.

foundry model run llama-3.2 --verbose

foundry model run llama-3.2 --verbose

Using the OpenAI Python SDK

The OpenAI Python SDK is a convenient way to interact with the Foundry Local REST API. You can install it using:

pip install openai
pip install foundry-local-sdk

Then, you can use the following code to run the model:

import openai
from foundry_local import FoundryLocalManager

modelId = "llama-3.2"

# Create a FoundryLocalManager instance. This will start the Foundry 
# Local service if it is not already running and load the specified model.
manager = FoundryLocalManager(modelId)

# The remaining code us es the OpenAI Python SDK to interact with the local model.

# Configure the client to use the local Foundry service
client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(
    model=manager.get_model_info(modelId).id,
    messages=[{"role": "user", "content": "What is the golden ratio?"}],
    stream=True
)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Tip

You can use any language that supports HTTP requests. For more information, read the Integrated inferencing SDKs with Foundry Local article.

Finishing up

After you're done using the custom model, you should reset the model cache to the default directory using:

foundry cache cd ./foundry/cache/models

Share via

Compile Hugging Face models to run on Foundry Local

Prerequisites

Install Olive

Sign in to Hugging Face

Compile the model

Step 1: Run the Olive auto-opt command

Step 2: Rename the output model

Step 3: Create chat template file

Run the model

Using the Foundry Local CLI

Using the OpenAI Python SDK

Finishing up

Next steps

Feedback

Additional resources