Quickstart: Deploy your first model and run inference on Foundry Local

This article shows you how to use an existing Foundry Local environment to browse the model catalog, create your first model deployment, and send inference requests.

Important

  • Foundry Local is available in preview. Preview releases provide early access to features that are in active deployment.
  • Features, approaches, and processes can change or have limited capabilities before general availability (GA).

Prerequisites

Before you begin, make sure you have:

  • Preview deployment access. Foundry Local on Azure Local is currently available by request during preview. Submit the access request form: Request preview deployment access. After your request is reviewed, you receive guidance on next steps for deployment.
  • An active Azure subscription. If you don't have one, create one before you begin.
  • A Kubernetes cluster (version 1.29 or later) connected to Azure Arc, or a direct Kubernetes deployment.
  • kubectl installed and configured for your cluster.
  • Foundry Local deployed to your Kubernetes cluster. For deployment steps, see Deploy Foundry Local as an Azure Arc extension. Helm is also a supported deployment option, and installation instructions are provided during preview access onboarding.
  • Authentication configured for your Foundry Local deployment. For setup steps, see Configure Entra ID authentication or use API key authentication as described in Authentication and authorization.
  • (Optional) A namespace strategy if you plan to deploy models outside the default foundry-local-operator namespace. Namespace configuration must be set during installation. For more information, see Namespace configuration for model deployments.

List available models

After you deploy Foundry Local and complete authentication, you can browse the model catalog. Foundry Local supports two approaches for managing models:

  • kubectl — Work directly with Kubernetes custom resources (ModelDeployment CRDs).
  • Foundry Local REST API — Use HTTP endpoints exposed by the inference operator.

View the full model catalog to see which models are available for deployment:

kubectl get configmap foundry-local-catalog -n foundry-local-operator -o jsonpath="{.data['catalog\.json']}"

For a table-style catalog:

kubectl get configmap foundry-local-catalog -n foundry-local-operator -o jsonpath="{.data['catalog\.json']}" | ConvertFrom-Json | Select-Object -ExpandProperty models | Format-Table alias, displayName, task, framework

Deploy a model

Choose the model you want from the catalog and create a deployment.

  1. Create a YAML file (for example, model-deployment.yaml) with a ModelDeployment resource. Replace the placeholder values with the model name from the catalog and your desired configuration:

    apiVersion: foundrylocal.azure.com/v1
    kind: ModelDeployment
    metadata:
      name: <deployment-name>
      namespace: foundry-local-operator
    spec:
      model:
        catalog:
          name: <model-name-from-catalog>
          version: "latest"
      compute: gpu              # or cpu
      runtime: vllm             # or onnx-genai
      workloadType: generative
      replicas: 1
      resources:
        requests:
          cpu: "2"
          memory: "32Gi"
        limits:
          cpu: "4"
          memory: "64Gi"
          gpu: 1
    
  2. Apply the manifest to deploy the model:

    kubectl apply -f model-deployment.yaml
    

Verify the deployment status

Confirm the model deployment is ready before sending inference requests.

Check whether a specific model deployment is ready:

kubectl get modeldeployment <deployment-name> -n foundry-local-operator

For detailed status information including events and conditions:

kubectl describe modeldeployment <deployment-name> -n foundry-local-operator

To list all deployed models across all namespaces:

kubectl get modeldeployment -A

Send an inference request

When the model's status shows Running, you can send inference requests.

  1. Set up port forwarding to the model's service:

    kubectl port-forward svc/<deployment-name> -n foundry-local-operator 5000:5000
    
  2. Retrieve the model's API key (the value is Base64-encoded):

    kubectl get secret <deployment-name>-api-keys -n foundry-local-operator -o jsonpath="{.data.primary-key}"
    
  3. Decode the key, open a new terminal, and send a chat completion request:

    curl -k -X POST https://localhost:5000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "api-key: <your-decoded-api-key>" \
      -d '{"model":"<model-name>","messages":[{"role":"user","content":"Hello, what can you do?"}],"max_tokens":256}'