Hello Aidan Gallagher,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are experiencing error 504 when getting job status from your Azure batch endpoint.
To address the 504 Gateway Timeout error when polling Azure Batch Endpoints, you can implement retry logic using the tenacity library to handle transient errors, such as in the following:
from tenacity import retry, wait_fixed, stop_after_attempt
@retry(wait=wait_fixed(2), stop=stop_after_attempt(5))
def get_job():
return next(
(
j
for j in ml_client.batch_endpoints.list_jobs(endpoint_name)
if j.properties.name == job_name
),
None,
)
job = get_job()
Then, increase the timeout settings for your requests, and reduce the frequency of your polling to lessen the load on the service.
Additionally, check the Azure Service Health dashboard for any ongoing issues in the uksouth region, and consider reaching out to Azure Support for further assistance.
Lastly, optimizing your code to minimize unnecessary API calls can also help improve reliability and reduce timeouts.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.