Edit

Share via


Best practices and troubleshooting guide for Foundry Local

Important

  • Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
  • Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

This document provides best practices and troubleshooting tips for Foundry Local.

Security best practices

  • Use Foundry Local in environments that comply with your organization's security policies.
  • When handling sensitive data, ensure your device meets your organization's security requirements.
  • Use disk encryption on devices where cached models might contain sensitive fine-tuning data.

Licensing considerations

When using Foundry Local, be aware of the licensing implications for the models you run. You can view full terms of model license for each model in the model catalog using:

foundry model info <model> --license

Production deployment scope

Foundry Local is designed for on-device inference and not distributed, containerized, or multi-machine production deployments.

Troubleshooting

Common issues and solutions

Issue Possible Cause Solution
Slow inference CPU-only model with large parameter count Use GPU-optimized model variants when available
Model download failures Network connectivity issues Check your internet connection and run foundry cache list to verify cache status
The service fails to start Port conflicts or permission issues Try foundry service restart or report an issue with logs using foundry zip-logs
Qualcomm NPU error (Qnn error code 5005: "Failed to load from EpContext model. qnn_backend_manager.") Qualcomm NPU error Under investigation

Improving performance

If you experience slow inference, consider the following strategies:

  • Simultaneously running ONNX models provided in the AI Toolkit for VS Code cause resource contention. Stop the AI Toolkit inference session before running Foundry Local.
  • Use GPU acceleration when available
  • Identify bottlenecks by monitoring memory usage during inference.
  • Try more quantized model variants (like INT8 instead of FP16)
  • Adjust batch sizes for non-interactive workloads