Azure Machine Learning Managed Online Endpoints and HTTP 429 Too Many Requests

Neil McAlister 311 Reputation points
2024-02-29T12:38:58.9366667+00:00

Hi - Please can you help me understand the limits imposed by the managed online endpoints ?

I'm attempting to simulate making concurrent calls to the endpoint, which helps understand what happens when several calls may be issued by callers at the same time. At the moment I can't get the endpoints to respond to more than a set number requests at the same time. That set number appears to be determined by the amount of instances you are running per deployment - at that number seems set at 2 per instance - which is a low number When testing I have...

  • created a Managed Online Endpoint
  • created a single Deployment of a model using 1 instance
  • used k6 from https://k6.io/ to test performance

When running the following command using k6 to the endpoint/deployment...

k6 run --vus 2 --duration 1s .\sample_test.js

...I receive 200 responses. If I increase this to...

k6 run --vus 3 --duration 1s .\sample_test.js

...I receive a mix of 429 and 200 responses.

If I increase the number of running instances of the Deployment to 2 I can increase my --vus switch to 4 and receive 200's - but stepping to 5 I receive the mix of 429 and 200 responses again

Similarly if I increase my Deployment instance count to 3 then --vus can increase to 6 and no more before 429's reoccur.

The magic number per Deployment then appears to be just 2. It doesn't matter what size SKU you use on your Deployment compute - I've had the SKU set anywhere between x2 and x16 vCPU's and various RAM configurations with no change in the numbers.

I originally tested this in the UKSouth region, and to eliminate that region as an issue I also spun up the same in the WestEurope region with the same results

I also tested single calls at the same time from x3 different locations (x3 different VM's rather than my local client machine) to test x3 different inbound IP requests at the same time - with the same results.

My question therefore is - why is number so low (2) and if this is Quota thing, what am I doing here to increase this? And is this documented anywhere - thanks

quota

To recreate what I have done...

  • Deploy a managed online endpoint
  • Deploy a deployment of your model to that endpoint - running a single instance of 1
  • Have some sample request data in a file (sample_data.json in my example on a Windows machine)
  • Install k6 from https://k6.io/docs/get-started/installation/
  • Use the sample_test.js file below and replacing your values under const for...
    • apiUrl
    • bearerToken
    • modeldeploymentname
    • jsonDataPath
  • Run the k6 command
    • k6 run --vus 2 --duration 1s .\sample_test.js
      • This will report 200s
    • k6 run --vus 3 --duration 1s .\sample_test.js
      • This will report 429s and 200s

Note: I chose the k6 tool to recreate an issue one of our developers found when receiving 429 errors when calling the model deployment from one of our APIs - the k6 itself is not the issue, merely a simple way to recreate the same problem experienced by other calls in.

Thanks in advance - Neil

The sample_test.js file you can use

// Sample Test Call using K6 - Neil McAlister Feb 2024
// For use with the k6 executable e.g.
// k6 run --vus 2 --duration 1s c:\location\pathtothis\sample_test.js

import http from 'k6/http';
import { check } from 'k6';

// Replace these values with your actual endpoint and token
const apiUrl = 'https://yourendpointname.yourlocation.inference.ml.azure.com/score';
const bearerToken = 'YourEndpointPrimaryKeyFromTheConsumeTab';
const modeldeploymentname = 'YourDeploymentName'

// Path to your JSON file containing input data
const jsonDataPath = 'C:\\WindowsPath\\To\\YourJSONfile\\sample_data.json';

// Read input data from the JSON file
const jsonData = open(jsonDataPath);

export default function () {
  // Define the HTTP headers, including the Authorization header with the bearer token
  const headers = {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${bearerToken}`,
    'azureml-model-deployment' : `${modeldeploymentname}`,
  };

  // Make the POST request
  const response = http.post(apiUrl, jsonData, { headers: headers });

  // Check if the request was successful (status code 2xx)
  check(response, {
    'Request was successful': (r) => r.status >= 200 && r.status < 300,
  });

  // Log the response details (you can remove this in a real test)
  console.log(`Response status: ${response.status}`);
  //console.log(`Response body: ${response.body}`);
}
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,340 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Neil McAlister 311 Reputation points
    2024-03-07T10:28:47.0633333+00:00

    We found the answer internally - and were looking in the wrong place - it's the deployment itself that throttles the concurrent connections NOT the endpoint.

    This article on troubleshooting helped find the way https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints?view=azureml-api-2&tabs=cli#common-error-codes-for-managed-online-endpoints

    To quote the Status Code 429 section - "Your model is currently getting more requests than it can handle. Azure Machine Learning has implemented a system that permits a maximum of 2 * max_concurrent_requests_per_instance * instance_count requests to be processed in parallel at any given moment to guarantee smooth operation."

    It points to another article on the deployment schema YAML https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online?view=azureml-api-2#requestsettings

    By increasing the setting max_concurrent_requests_per_instance upwards we were able to push more concurrent connections through the endpoint...sample snippet from the YAML file would look something like this...

    ...
    environment: 
      conda_file: conda.yaml
      image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest
    request_settings:
      max_concurrent_requests_per_instance: 4
    ...
    

    Things to note: Our PKL format model is small - in a small number Kb - so your results may vary depending on the size of the model and the size of the compute behind your deployment instance

    I don't believe that the max_concurrent_requests_per_instance setting is something available to be set in the AML Studio GUI itself - only by using the CLI to deploy. It would probably be a good thing to be able to set these in the GUI.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.