Near real-time scoring

Scoring implementation is on a per-record basis, but there is no way to satisfy the request fast enough due to the complexity and/or size of the model. Even the complexity of the workflow itself may be higher (complex data preprocessing algorithms, several models in the flow, etc.). The following discusses the benefits of monitoring model performance and operation in near real-time.

Azure Durable Functions

Azure Durable Functions are effective if it is possible to implement an inferencing script as a single thread process, particularly when a GPU is not required. In most cases the model is likely to be small, a data preprocessing flow is not needed and doesn’t require many resources. It is worth noting that Azure Function framework is open source. So, it can be deployed anywhere.

Advantages:

  • Supports staging deployment Platform as a Service approach if deployed to Azure.
  • Can be built in many different programming languages like Python and C#.
  • Supports deployment in containers if needed and can be deployed locally.

Disadvantages:

  • Doesn’t support GPU.

For more information:

What are Durable Functions?

Azure Functions with KEDA

This is an extension of the previous option. Here we can combine AKS, Azure Functions, Service Bus Queue and Triton Server (optional) to deploy a scalable, near-real-time service to the cloud. We can use Azure Functions (even not durable) on AKS to do inferencing and scale scoring nodes based on the number of incoming requests thanks to KEDA. Triton (https://developer.nvidia.com/nvidia-triton-inference-server) or DeepSpeed (https://www.deepspeed.ai/) are used to accelerate computations on GPUs. Service Bus Queue is utilized to get requests and provide responses.

Advantages:

  • Can be used as a custom reference architecture for most complex scenarios.
  • Large models are supported.

Disadvantages:

  • Complex infrastructure that requires manual deployment and management.
  • Deep Knowledge in some technologies like Kubernetes is required.