Edit

Share via


Compile Hugging Face models to run on Foundry Local

Important

  • Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
  • Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

Foundry Local runs ONNX models on your device with high performance. While the model catalog offers out-of-the-box precompiled options, you can use any model in the ONNX format.

To compile existing models in Safetensor or PyTorch format into the ONNX format, you can use Olive. Olive is a tool that optimizes models to ONNX format, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.

This guide shows you how to:

  • Convert and optimize models from Hugging Face to run in Foundry Local. You'll use the Llama-3.2-1B-Instruct model as an example, but you can use any generative AI model from Hugging Face.
  • Run your optimized models with Foundry Local

Prerequisites

  • Python 3.10 or later

Install Olive

Olive is a tool that optimizes models to ONNX format.

pip install olive-ai[auto-opt]

Tip

For best results, install Olive in a virtual environment using venv or conda.

Sign in to Hugging Face

You optimize the Llama-3.2-1B-Instruct model, which requires Hugging Face authentication:

huggingface-cli login

Note

You must first create a Hugging Face token and request model access before proceeding.

Compile the model

Step 1: Run the Olive auto-opt command

Use the Olive auto-opt command to download, convert, quantize, and optimize the model:

olive auto-opt \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --trust_remote_code \
    --output_path models/llama \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

Note

The compilation process takes approximately 60 seconds, plus extra time for model download.

The command uses the following parameters:

Parameter Description
model_name_or_path Model source: Hugging Face ID, local path, or Azure AI Model registry ID
output_path Where to save the optimized model
device Target hardware: cpu, gpu, or npu
provider Execution provider (for example, CPUExecutionProvider, CUDAExecutionProvider)
precision Model precision: fp16, fp32, int4, or int8
use_ort_genai Creates inference configuration files

Tip

If you have a local copy of the model, you can use a local path instead of the Hugging Face ID. For example, --model_name_or_path models/llama-3.2-1B-Instruct. Olive handles the conversion, optimization, and quantization automatically.

Step 2: Rename the output model

Olive places files in a generic model directory. Rename it to make it easier to use:

cd models/llama
mv model llama-3.2

Step 3: Create chat template file

A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.

Foundry Local requires a chat template JSON file called inference_model.json in order to generate the appropriate responses. The template properties are the model name and a PromptTemplate object, which contains a {Content} placeholder that Foundry Local injects at runtime with the user prompt.

{
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  }
}

To create the chat template file, you can use the apply_chat_template method from the Hugging Face library:

Note

The following example uses the Python Hugging Face library to create a chat template. The Hugging Face library is a dependency for Olive, so if you're using the same Python virtual environment you don't need to install. If you're using a different environment, install the library with pip install transformers.

# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/llama/llama-3.2"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]


template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

Run the script using:

python generate_inference_model.py

Run the model

You can run your compiled model using the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:

foundry cache cd models
foundry cache ls  # should show llama-3.2

Caution

Remember to change the model cache back to the default directory when you're done by running:

foundry cache cd ./foundry/cache/models.

Using the Foundry Local CLI

foundry model run llama-3.2 --verbose

Using the OpenAI Python SDK

The OpenAI Python SDK is a convenient way to interact with the Foundry Local REST API. You can install it using:

pip install openai
pip install foundry-local-sdk

Then, you can use the following code to run the model:

import openai
from foundry_local import FoundryLocalManager

modelId = "llama-3.2"

# Create a FoundryLocalManager instance. This will start the Foundry 
# Local service if it is not already running and load the specified model.
manager = FoundryLocalManager(modelId)

# The remaining code us es the OpenAI Python SDK to interact with the local model.

# Configure the client to use the local Foundry service
client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(
    model=manager.get_model_info(modelId).id,
    messages=[{"role": "user", "content": "What is the golden ratio?"}],
    stream=True
)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Tip

You can use any language that supports HTTP requests. For more information, read the Integrated inferencing SDKs with Foundry Local article.

Finishing up

After you're done using the custom model, you should reset the model cache to the default directory using:

foundry cache cd ./foundry/cache/models

Next steps