Collecting package metadata (repodata.json): ...working... Error occurred

Jake K 36 Reputation points
2023-06-01T13:16:53.4733333+00:00

I am receiving the same error when deploying from Azure ML to an Azure Arc cluster on-premise. The error message (or lack thereof) is not very telling. I believe I have sufficient resources on my nodes (16gb, 4CPU). The inferenceserver container fails with

2023-05-31T20:38:18,907311776+00:00 | gunicorn/run | Updating conda environment from /var/azureml-app/azureml-models/credit_defaults_model/2/credit_defaults_model/conda.yaml !

Retrieving notices: ...working... done

./run: line 152: 64 Killed conda env create -n userenv -f "${CONDA_FILENAME}"

Collecting package metadata (repodata.json): ...working... Error occurred. Sleeping to send error logs.

How best to analyze this. Where does the error log indicated in the message end up?

The container via kubectl describe shows

State: Waiting

Reason: CrashLoopBackOff

Last State: Terminated

Reason: OOMKilled

Exit Code: 111

I have also created a new instance type with 3 CPU and 6 GB of RAM (since the default one limits to 1 CPU and 2 GB of RAM). But I am still getting the same error.

Thanks in advance!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,563 questions
{count} votes

Accepted answer
  1. Konstantinos Passadis 17,286 Reputation points
    2023-06-01T14:18:25.15+00:00

    Hello @Jake K !

    This detail :

    Reason: OOMKilled

    Points to Memory

    This typically occurs when the container exceeds the memory limits allocated to it and the operating system kills the process to free up resources.

    If increasing the memory allocation for the container is not sufficient, you might need to consider scaling up the resources for your Azure Arc cluster. This could involve adding more nodes to the cluster or using nodes with higher memory capacity.

    ALSO

    The error message you provided indicates that the failure occurs within the inferenceserver container. You can try accessing the logs of that container to gather more information about the error. You can use the Kubernetes command-line tool (kubectl) to view the logs of the container. Run the following command to view the logs of the inferenceserver container:

    kubectl logs <inferenceserver-pod-name> -c inferenceserver
    

    Replace <inferenceserver-pod-name> with the actual name of the pod running the inferenceserver container.

    The container logs may provide more detailed error messages or stack traces that can help pinpoint the cause of the failure.

     I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    2 people found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful