Azure batch endpoint responding with 504 when getting job status

Aidan Gallagher 5 Reputation points
2024-10-28T14:46:10.5333333+00:00

I have tasks running which poll the status of my running batch jobs in AML. Recently this has become increasingly unreliable, culminating in the last few days, where pretty much every request has responded with the below. It does not occur every time, though it seems to be 95%+ of the time, and in particular seems to always happen when there are more jobs running (and therefore more polling asks on my end). The relevant lines of code which are failing are as follows (ml_client is just an instance of the Azure Python SDK MLClient)

...
job = next(
    (
        j
        for j in ml_client.batch_endpoints.list_jobs(endpoint_name)
        if j.properties.name == job_name
    ),
    None,
)
{
  "Content": {
    "error": {
      "code": "TransientError",
      "severity": null,
      "message": "Service invocation timed out. \r\nRequest: GET batch-endpoint.vienna-uksouth.svc/batch-endpoint/v1.0/subscriptions/[...]/resourceGroups/[...]/providers/Microsoft.MachineLearningServices/workspaces/[...]/batchEndpoints/[...]/jobs/ \r\n Message: Operation canceled Time waited: 00:00:10.0016964",
      "messageFormat": null,
      "messageParameters": null,
      "referenceCode": null,
      "detailsUri": null,
      "target": "GET http://
batch-endpoint.vienna-uksouth.svc/batch-endpoint/v1.0/subscriptions/[...]/resourceGroups/[...]/providers/Microsoft.MachineLearningServices/workspaces/[...]/batchEndpoints/[...]/jobs/?continuationToken={\"SourceIndex\":0,\"ContinuationToken\":\"[...]"}",
      "details": [],
      "innerError": null,
      "debugInfo": null,
      "additionalInfo": null
    },
    "correlation": {
      "operation": "[...]",
      "request": "[...]"
    },
    "environment": "uksouth",
    "location": "uksouth",
    "time": "2024-10-25T08:23:02.6477619+00:00",
    "componentName": "managementfrontend",
    "statusCode": 504
  }
}

It looks like the issue is caused by a timeout somewhere, but I can't see where this is configured (looks like something I probably don't have control over), so I wasn't sure where to go from here

Azure Machine Learning
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 26,661 Reputation points Volunteer Moderator
    2024-10-28T20:17:01.64+00:00

    Hello Aidan Gallagher,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are experiencing error 504 when getting job status from your Azure batch endpoint.

    To address the 504 Gateway Timeout error when polling Azure Batch Endpoints, you can implement retry logic using the tenacity library to handle transient errors, such as in the following:

    from tenacity import retry, wait_fixed, stop_after_attempt
    @retry(wait=wait_fixed(2), stop=stop_after_attempt(5))
    def get_job():
        return next(
            (
                j
                for j in ml_client.batch_endpoints.list_jobs(endpoint_name)
                if j.properties.name == job_name
            ),
            None,
        )
    job = get_job()
    

    Then, increase the timeout settings for your requests, and reduce the frequency of your polling to lessen the load on the service.

    Additionally, check the Azure Service Health dashboard for any ongoing issues in the uksouth region, and consider reaching out to Azure Support for further assistance.

    Lastly, optimizing your code to minimize unnecessary API calls can also help improve reliability and reduce timeouts.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.