Run ONNX inference in WebAssembly (WASM) data flow graphs

This article shows how to embed and run small Open Neural Network Exchange (ONNX) models inside your WebAssembly modules to perform in-band inference as part of Azure IoT Operations data flow graphs. Use this approach for low-latency enrichment and classification directly on streaming data without calling external prediction services.

Prerequisites

Before you start, make sure you have the following items:

An Azure IoT Operations deployment with data flow graphs capability.
A container registry like Azure Container Registry and a container registry endpoint configured. To learn more, see Configure registry endpoints.
A development environment set up for WebAssembly module development. For options and detailed environment setup instructions, see Develop WebAssembly modules.

Important

Data flow graphs currently only support MQTT (Message Queuing Telemetry Transport), Kafka, and OpenTelemetry endpoints. Other endpoint types like Azure Data Lake, Microsoft Fabric OneLake, Azure Data Explorer, and local storage aren't supported. For more information, see Known issues.

Benefits of in-band ONNX inference

With Azure IoT Operations data flow graphs, you can embed small ONNX model inference directly in the pipeline instead of calling an external prediction service. This approach offers the following benefits:

Low latency: Perform real-time enrichment or classification in the same operator path where data arrives. Each message requires only local CPU inference, avoiding network round trips.
Inline with stream processing: Run inference alongside multi-source stream processing where features are already collocated in the graph, and align with event-time semantics so inference uses the same timestamps as other operators.
Simple updates: Ship a new module with WASM and embedded model, then update the graph definition reference. You don't need a separate model registry or external endpoint change.
Horizontal scaling: Inference scales as the data flow graph scales. When the runtime adds more workers for throughput, each worker loads the embedded model and participates in load balancing.

Inference runs on a CPU through the WebAssembly System Interface (WASI) wasi-nn interface. For supported model formats and hardware constraints, see Limitations of ONNX inference in WASM data flow graphs.

When should you use in-band ONNX inference?

Use in-band ONNX inference in data flow graphs when you need the following capabilities:

Low latency to enrich or classify messages inline at ingestion time.
Small, efficient models such as MobileNet-class vision models.
Inference that aligns with event-time processing and uses the same timestamps as other operators.
Simple model updates by shipping a new module version.

Don't use in-band ONNX inference when you need the following capabilities:

Large transformer models, GPU or TPU acceleration, or sophisticated A/B rollouts.
Models that require multiple tensor inputs, key-value caching, or unsupported ONNX operators.

Note

Keep modules and embedded models small. Large models and memory-heavy workloads aren't supported. Use compact architectures and small input sizes like 224×224 for image classification.

Architecture pattern for ONNX inference in data flow graphs

The common pattern for ONNX inference in data flow graphs includes the following stages:

Preprocess data: Transform raw input data to match your model's expected format. For image models, this process typically involves:
- Decoding image bytes.
- Resizing to a target dimension (for example, 224×224).
- Converting the color space (for example, RGB to BGR).
- Normalizing pixel values to the expected range (0–1 or -1 to 1).
- Arranging data in the correct tensor layout: NCHW (batch, channels, height, width) or NHWC (batch, height, width, channels).
Run inference: Convert preprocessed data into tensors using the wasi-nn interface, load your embedded ONNX model with the CPU backend, set input tensors on the execution context, invoke the model's forward pass, and retrieve output tensors containing raw predictions.
Postprocess outputs: Transform raw model outputs into meaningful results. Common operations:
- Apply softmax to produce classification probabilities.
- Select top-K predictions.
- Apply a confidence threshold to filter low-confidence results.
- Map prediction indices to human-readable labels.
- Format results for downstream consumption.

In the IoT samples for Rust WASM operators you can find two samples that follow this pattern:

Data transformation "format" sample: decodes and resizes images to RGB24 224×224.
Image/Video processing "snapshot" sample: embeds a MobileNet v2 ONNX model, runs CPU inference, and computes softmax.

Configure graph definition

To enable ONNX inference in your data flow graph, configure both the graph structure and module parameters. The graph definition specifies the pipeline flow, while module configurations allow runtime customization of preprocessing and inference behavior.

Enable the wasi-nn feature

To enable the WebAssembly Neural Network (wasi-nn) interface for ONNX inference, add the wasi-nn feature to your graph definition:

moduleRequirements:
  apiVersion: "1.1.0"
  runtimeVersion: "1.1.0"
  features:
    - name: "wasi-nn"

Define operations for the inference pipeline

Configure the operations that form your ONNX inference pipeline. This example shows a typical image classification workflow:

operations:
  - operationType: "source"
    name: "camera-input"
  - operationType: "map"
    name: "module-format/map"
    module: "format:1.0.0"
  - operationType: "map"
    name: "module-snapshot/map"
    module: "snapshot:1.0.0"
  - operationType: "sink"
    name: "results-output"

connections:
  - from: { name: "camera-input" }
    to: { name: "module-format/map" }
  - from: { name: "module-format/map" }
    to: { name: "module-snapshot/map" }
  - from: { name: "module-snapshot/map" }
    to: { name: "results-output" }

This configuration creates a pipeline where:

camera-input receives raw image data from a source
module-format/map preprocesses images (decode, resize, format conversion)
module-snapshot/map runs ONNX inference and postprocessing
results-output emits classification results to a sink

Configure module parameters

Define runtime parameters to customize WASM module behavior without rebuilding. These parameters pass to your WASM modules at initialization.

For details, see Module configuration parameters.

Package the model

Embedding ONNX models directly into your WASM component ensures atomic deployment and version consistency. This approach simplifies distribution and removes runtime dependencies on external model files or registries.

Tip

Embedding keeps the model and operator logic versioned together. To update a model, publish a new module version and update your graph definition to reference it. This approach eliminates model drift and ensures reproducible deployments.

ONNX model preparation requirements

Before embedding your model, make sure it meets the requirements for WASM deployment:

Keep models under 50 MB for practical WASM loading times and memory constraints.
Check that your model accepts a single tensor input in a common format (float32 or uint8).
Check that the WASM ONNX runtime backend supports every operator your model uses.
Use ONNX optimization tools to reduce model size and improve inference speed.

Steps to embed an ONNX model in a WASM module

Follow these steps to embed your ONNX model and associated resources in a WASM module:

Organize model assets: Place the .onnx model file and optional labels.txt in your source tree. Use a dedicated directory structure such as src/fixture/models/ and src/fixture/labels/ for clear organization.
Embed at compile time: Use language-specific tools to include model bytes in your binary. In Rust, use include_bytes! for binary data and include_str! for text files.
Initialize the wasi-nn graph: In your operator's init function, create a wasi-nn graph from the embedded bytes, specifying the ONNX encoding and CPU execution target.
Implement inference loop: For each incoming message, preprocess inputs to match model requirements, set input tensors, execute inference, retrieve outputs, and apply postprocessing.
Handle errors gracefully: Implement proper error handling for model loading failures, unsupported operators, and runtime inference errors.

For a complete implementation pattern, see the "snapshot" sample.

Recommended WASM project structure

Organize your WASM module project with clear separation of concerns:

src/
├── lib.rs                 # Main module implementation
├── model/
│   ├── mod.rs            # Model management module
│   └── inference.rs      # Inference logic
└── fixture/
    ├── models/
    │   ├── mobilenet.onnx      # Primary model
    │   └── mobilenet_opt.onnx  # Optimized variant
    └── labels/
        ├── imagenet.txt        # ImageNet class labels
        └── custom.txt          # Custom label mappings

Example file layout from the snapshot sample

Use the following file layout from the "snapshot" sample as a reference:

Labels directory - Contains various label mapping files
Models directory - Contains ONNX model files and metadata

Minimal Rust example for embedding an ONNX model

The following Rust example shows the minimum code needed to embed an ONNX model in a WASM module. The paths are relative to the source file that contains the macro:

// src/lib.rs (example)
// Embed ONNX model and label map into the component
static MODEL: &[u8] = include_bytes!("fixture/models/mobilenet.onnx");
static LABEL_MAP: &[u8] = include_bytes!("fixture/labels/synset.txt");

fn init_model() -> Result<(), anyhow::Error> {
  // Create wasi-nn graph from embedded ONNX bytes using the CPU backend
  // Pseudocode – refer to the snapshot sample for the full implementation
  // use wasi_nn::{graph::{load, GraphEncoding, ExecutionTarget}, Graph};
  // let graph = load(&[MODEL.to_vec()], GraphEncoding::Onnx, ExecutionTarget::Cpu)?;
  // let exec_ctx = Graph::init_execution_context(&graph)?;
  Ok(())
}

Reuse the ONNX graph and execution context across messages

To avoid recreating the ONNX graph and execution context for every message, initialize them once and reuse them. The public snapshot sample uses a static LazyLock to initialize the graph and execution context once per worker:

use crate::wasi::nn::{
    graph::{load, ExecutionTarget, Graph, GraphEncoding, GraphExecutionContext},
    tensor::{Tensor, TensorData, TensorDimensions, TensorType},
};

static mut CONTEXT: LazyLock<GraphExecutionContext> = LazyLock::new(|| {
    let graph = load(&[MODEL.to_vec()], GraphEncoding::Onnx, ExecutionTarget::Cpu).unwrap();
    Graph::init_execution_context(&graph).unwrap()
});

fn run_inference(/* input tensors, etc. */) {
    unsafe {
        // (*CONTEXT).compute()?;
    }
}

Debug and test your ONNX module locally

Before deploying to Azure IoT Operations, test your ONNX inference module locally to validate functionality and performance. To learn more, see:

Configure and deploy your ONNX module

When you're ready to use your ONNX inference module in an Azure IoT Operations data flow or connector, follow these steps:

Example: MobileNet image classification

The IoT public samples provide two samples wired into a graph for image classification:

The "format" sample provides image decode and resize functionality.
The "snapshot" sample provides ONNX inference and softmax processing.

To learn more about and run the sample that uses these modules, see Example 2: Deploy a complex graph.

Limitations of ONNX inference in WASM data flow graphs

Inference in WASM data flow graphs has the following limitations:

Model format: ONNX only. Data flow graphs don't support other formats like TFLite.
Hardware: CPU only. GPU and TPU acceleration aren't supported.
Model size: Large models and memory-intensive inference aren't supported. Use small models such as MobileNet-class architectures.
Model inputs: Only single-tensor input models are supported. Multi-input models, key-value caching, and advanced sequence or generative scenarios aren't supported.
Operator support: The ONNX backend in the WASM runtime must support every operator your model uses. If an operator isn't supported, inference fails at load or execution time.

Feedback

Was this page helpful?

Last updated on 2026-06-02