Azure Machine Learning Managed Online Endpoints and HTTP 429 Too Many Requests

Question

Azure Machine Learning Managed Online Endpoints and HTTP 429 Too Many Requests

Neil McAlister 311

Hi - Please can you help me understand the limits imposed by the managed online endpoints ?

I'm attempting to simulate making concurrent calls to the endpoint, which helps understand what happens when several calls may be issued by callers at the same time. At the moment I can't get the endpoints to respond to more than a set number requests at the same time. That set number appears to be determined by the amount of instances you are running per deployment - at that number seems set at 2 per instance - which is a low number When testing I have...

created a Managed Online Endpoint
created a single Deployment of a model using 1 instance
used k6 from https://k6.io/ to test performance

When running the following command using k6 to the endpoint/deployment...

k6 run --vus 2 --duration 1s .\sample_test.js

...I receive 200 responses. If I increase this to...

k6 run --vus 3 --duration 1s .\sample_test.js

...I receive a mix of 429 and 200 responses.

If I increase the number of running instances of the Deployment to 2 I can increase my --vus switch to 4 and receive 200's - but stepping to 5 I receive the mix of 429 and 200 responses again

Similarly if I increase my Deployment instance count to 3 then --vus can increase to 6 and no more before 429's reoccur.

The magic number per Deployment then appears to be just 2. It doesn't matter what size SKU you use on your Deployment compute - I've had the SKU set anywhere between x2 and x16 vCPU's and various RAM configurations with no change in the numbers.

I originally tested this in the UKSouth region, and to eliminate that region as an issue I also spun up the same in the WestEurope region with the same results

I also tested single calls at the same time from x3 different locations (x3 different VM's rather than my local client machine) to test x3 different inbound IP requests at the same time - with the same results.

My question therefore is - why is number so low (2) and if this is Quota thing, what am I doing here to increase this? And is this documented anywhere - thanks

quota

To recreate what I have done...

Deploy a managed online endpoint
Deploy a deployment of your model to that endpoint - running a single instance of 1
Have some sample request data in a file (sample_data.json in my example on a Windows machine)
Install k6 from https://k6.io/docs/get-started/installation/
Use the sample_test.js file below and replacing your values under const for...
- apiUrl
- bearerToken
- modeldeploymentname
- jsonDataPath
Run the k6 command
- k6 run --vus 2 --duration 1s .\sample_test.js
  - This will report 200s
- k6 run --vus 3 --duration 1s .\sample_test.js
  - This will report 429s and 200s

Note: I chose the k6 tool to recreate an issue one of our developers found when receiving 429 errors when calling the model deployment from one of our APIs - the k6 itself is not the issue, merely a simple way to recreate the same problem experienced by other calls in.

Thanks in advance - Neil

The sample_test.js file you can use

// Sample Test Call using K6 - Neil McAlister Feb 2024
// For use with the k6 executable e.g.
// k6 run --vus 2 --duration 1s c:\location\pathtothis\sample_test.js

import http from 'k6/http';
import { check } from 'k6';

// Replace these values with your actual endpoint and token
const apiUrl = 'https://yourendpointname.yourlocation.inference.ml.azure.com/score';
const bearerToken = 'YourEndpointPrimaryKeyFromTheConsumeTab';
const modeldeploymentname = 'YourDeploymentName'

// Path to your JSON file containing input data
const jsonDataPath = 'C:\\WindowsPath\\To\\YourJSONfile\\sample_data.json';

// Read input data from the JSON file
const jsonData = open(jsonDataPath);

export default function () {
  // Define the HTTP headers, including the Authorization header with the bearer token
  const headers = {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${bearerToken}`,
    'azureml-model-deployment' : `${modeldeploymentname}`,
  };

  // Make the POST request
  const response = http.post(apiUrl, jsonData, { headers: headers });

  // Check if the request was successful (status code 2xx)
  check(response, {
    'Request was successful': (r) => r.status >= 200 && r.status < 300,
  });

  // Log the response details (you can remove this in a real test)
  console.log(`Response status: ${response.status}`);
  //console.log(`Response body: ${response.body}`);
}

santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-01T06:42:32.54+00:00

Hi @Neil McAlister,

Thank you for reaching out to Microsoft Q&A forum!

Based on the information provided, it seems you are hitting the limit of the maximum number of concurrent requests per instance allowed for the deployment.

The number of concurrent requests that a managed online endpoint can handle is limited by the number of instances running in the deployment. By default, each instance can handle up to two concurrent requests. This limit can be increased by scaling up the deployment or by adding more instances to the deployment.

You can choose to scale your resource manually to a specific instance count, or via a custom Autoscale policy that scales based on metric(s) thresholds, or schedule instance count which scales during designated time windows.

See: Autoscale an online endpoint.

Also see: how-to-deploy-online-endpoints.

Regarding the quota, there are quotas for other resources such as the number of endpoints, deployments, and instances that can be created in a workspace. You can find more information about quotas in the official documentation: how-to-manage-quotas.

To increase the quota, see this: request-quota-&-limit-increases.

I hope this helps! Thank you.
Neil McAlister 311 Reputation points

2024-03-01T08:24:18.9+00:00

@santoshkc thanks for your comprehensive reply

The limit of 2 concurrent connections per deployment via an endpoint seems arbitrarily low - is this mentioned in the official documentation anywhere that is available publicly? Thanks

Is this limit of 2 also relevant if you chose the Kubernetes compute type on an endpoint? (rather than Managed)

Thanks

Neil
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-01T11:39:59.6433333+00:00

Hi @Neil McAlister,

Thank you for your follow up question.

The limit of 2 concurrent connections per deployment via an endpoint is not explicitly mentioned in the official Azure Machine Learning documentation. But you can increase limit by request.

If you are using Kubernetes compute type on an endpoint, the limit may be different and would depend on the configuration of your Kubernetes cluster.

I hope you understand! Thank you.
Neil McAlister 311 Reputation points

2024-03-01T12:12:58.7566667+00:00

Hi @santoshkc - thanks again

It's a shame this isn't in the documentation - I think it should be.

When you say I can increase the limit by request - which limit section am I asking to be raised?
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-02T05:52:31.6866667+00:00

Hi @Neil McAlister,

I understand your concern about the lack of documentation on this specific limit. I will pass on your feedback to the Azure Machine Learning team.

Regarding your question about which limit section to request to be raised, you can request to increase the limit for Azure Machine Learning Managed Online Endpoints. Specifically, you can request to increase the limit for the number of instances running per deployment, which would increase the rate limit for the endpoint.

To request a limit increase, you can follow the steps outlined in the following documentation:

request-quota-increase.

I hope this helps. Thank you.

1 answer

Your answer

santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-01T06:42:32.54+00:00

Hi @Neil McAlister,

Thank you for reaching out to Microsoft Q&A forum!

Based on the information provided, it seems you are hitting the limit of the maximum number of concurrent requests per instance allowed for the deployment.

The number of concurrent requests that a managed online endpoint can handle is limited by the number of instances running in the deployment. By default, each instance can handle up to two concurrent requests. This limit can be increased by scaling up the deployment or by adding more instances to the deployment.

You can choose to scale your resource manually to a specific instance count, or via a custom Autoscale policy that scales based on metric(s) thresholds, or schedule instance count which scales during designated time windows.

See: Autoscale an online endpoint.

Also see: how-to-deploy-online-endpoints.

Regarding the quota, there are quotas for other resources such as the number of endpoints, deployments, and instances that can be created in a workspace. You can find more information about quotas in the official documentation: how-to-manage-quotas.

To increase the quota, see this: request-quota-&-limit-increases.

I hope this helps! Thank you.
Neil McAlister 311 Reputation points

2024-03-01T08:24:18.9+00:00

@santoshkc thanks for your comprehensive reply

The limit of 2 concurrent connections per deployment via an endpoint seems arbitrarily low - is this mentioned in the official documentation anywhere that is available publicly? Thanks

Is this limit of 2 also relevant if you chose the Kubernetes compute type on an endpoint? (rather than Managed)

Thanks

Neil
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-01T11:39:59.6433333+00:00

Hi @Neil McAlister,

Thank you for your follow up question.

The limit of 2 concurrent connections per deployment via an endpoint is not explicitly mentioned in the official Azure Machine Learning documentation. But you can increase limit by request.

If you are using Kubernetes compute type on an endpoint, the limit may be different and would depend on the configuration of your Kubernetes cluster.

I hope you understand! Thank you.
Neil McAlister 311 Reputation points

2024-03-01T12:12:58.7566667+00:00

Hi @santoshkc - thanks again

It's a shame this isn't in the documentation - I think it should be.

When you say I can increase the limit by request - which limit section am I asking to be raised?
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-02T05:52:31.6866667+00:00

Hi @Neil McAlister,

I understand your concern about the lack of documentation on this specific limit. I will pass on your feedback to the Azure Machine Learning team.

Regarding your question about which limit section to request to be raised, you can request to increase the limit for Azure Machine Learning Managed Online Endpoints. Specifically, you can request to increase the limit for the number of instances running per deployment, which would increase the rate limit for the endpoint.

To request a limit increase, you can follow the steps outlined in the following documentation:

request-quota-increase.

I hope this helps. Thank you.

Answer 1

We found the answer internally - and were looking in the wrong place - it's the deployment itself that throttles the concurrent connections NOT the endpoint.

This article on troubleshooting helped find the way https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints?view=azureml-api-2&tabs=cli#common-error-codes-for-managed-online-endpoints

To quote the Status Code 429 section - "Your model is currently getting more requests than it can handle. Azure Machine Learning has implemented a system that permits a maximum of 2 * max_concurrent_requests_per_instance * instance_count requests to be processed in parallel at any given moment to guarantee smooth operation."

It points to another article on the deployment schema YAML https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online?view=azureml-api-2#requestsettings

By increasing the setting max_concurrent_requests_per_instance upwards we were able to push more concurrent connections through the endpoint...sample snippet from the YAML file would look something like this...

...
environment: 
  conda_file: conda.yaml
  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest
request_settings:
  max_concurrent_requests_per_instance: 4
...

Things to note: Our PKL format model is small - in a small number Kb - so your results may vary depending on the size of the model and the size of the compute behind your deployment instance

I don't believe that the max_concurrent_requests_per_instance setting is something available to be set in the AML Studio GUI itself - only by using the CLI to deploy. It would probably be a good thing to be able to set these in the GUI.

Share via

Azure Machine Learning Managed Online Endpoints and HTTP 429 Too Many Requests

1 answer

Your answer