We are getting frequent timeouts on mlflow on machine learning
On frequent basis, our code is getting timed out after 2 minutes when trying to get status of a ML flow job.
A recent example is showing the response as:
Retrying (Retry(total=6, connect=7, read=6, redirect=7, status=7)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='francecentral.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/28377576-89a8-4e51-8209-45265b06caee/resourceGroups/rg-prod-sales-azure/providers/Microsoft.MachineLearningServices/workspaces/mlw-sales-azure-prod-lc/api/2.0/mlflow/runs/search
we are making the call from the compute:
/subscriptions/28377576-89a8-4e51-8209-45265b06caee/resourceGroups/RG-PROD-MAIN/providers/Microsoft.Compute/virtualMachines/LearnVM | |
---|---|
/subscriptions/28377576-89a8-4e51-8209-45265b06caee/resourceGroups/RG-PROD-MAIN/providers/Microsoft.Compute/virtualMachines/LearnVM |
The call fails at:
/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py
while doing: urlopen
We are using mlflow library in python and we don't understand why on frequent basis, those calls are timing out after 2 minutes while all our infrastructure is located next to each other.
We are using libraries:
azure-ai-ml version 1.26.0
azureml-mlflow version 1.59.0.post1
mlflow-skinny version 2.21.1