Collecting package metadata (repodata.json): ...working... Error occurred

Jake K 36

I am receiving the same error when deploying from Azure ML to an Azure Arc cluster on-premise. The error message (or lack thereof) is not very telling. I believe I have sufficient resources on my nodes (16gb, 4CPU). The inferenceserver container fails with

2023-05-31T20:38:18,907311776+00:00 | gunicorn/run | Updating conda environment from /var/azureml-app/azureml-models/credit_defaults_model/2/credit_defaults_model/conda.yaml !

Retrieving notices: ...working... done

./run: line 152: 64 Killed conda env create -n userenv -f "${CONDA_FILENAME}"

Collecting package metadata (repodata.json): ...working... Error occurred. Sleeping to send error logs.

How best to analyze this. Where does the error log indicated in the message end up?

The container via kubectl describe shows

State: Waiting

Reason: CrashLoopBackOff

Last State: Terminated

Reason: OOMKilled

Exit Code: 111

I have also created a new instance type with 3 CPU and 6 GB of RAM (since the default one limits to 1 CPU and 2 GB of RAM). But I am still getting the same error.

Thanks in advance!

Jake K 36 Reputation points

2023-06-01T13:47:19.94+00:00

I also added an instance type of 3 CPU and 12 GB RAM and its still failing
Jake K 36 Reputation points

2023-06-01T16:01:21.6266667+00:00

So although I have 3 nodes with 4 CPU and 16 GB RAM each, the deployment was still failing with the OOMError. Then I created new instance types. It finally seems to work for the verylargecompute definition I created.

So definitely seemed to be resource related.

Thanks!
Jake K 36 Reputation points

2023-06-01T16:42:04.3233333+00:00

For anyone wanting to know how to create instance types please follow this link
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-kubernetes-instance-types

Jake K 36

I am most of the way through. But am still not getting a fully running pod. The inference-server container is failing its liveness and readiness probe.

   Liveness:   http-get http://:5001/ delay=10s timeout=2s period=10s #success=1 #failure=30
    Readiness:  http-get http://:5001/ delay=10s timeout=2s period=10s #success=1 #failure=30

The events show as follows:

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Pulled     15m                  kubelet            Container image "mcr.microsoft.com/mir/mir-storageinitializer:46571814.1631244300887" already present on machine
  Normal   Created    15m                  kubelet            Created container storageinitializer-modeldata
  Normal   Started    15m                  kubelet            Started container storageinitializer-modeldata
  Normal   Pulled     15m                  kubelet            Container image "mcr.microsoft.com/azureml/mlflow-ubuntu20.04-py38-cpu-inference:20230404.v2" already present on machine
  Normal   Created    15m                  kubelet            Created container inference-server
  Normal   Started    15m                  kubelet            Started container inference-server
  Normal   Pulled     15m                  kubelet            Container image "mcr.microsoft.com/azureml/amlarc/docker/identity-sidecar:1.1.27" already present on machine
  Normal   Created    15m                  kubelet            Created container identity-sidecar
  Normal   Started    15m                  kubelet            Started container identity-sidecar
  Warning  Unhealthy  14m (x7 over 15m)    kubelet            Liveness probe failed: HTTP probe failed with statuscode: 502
  Warning  Unhealthy  15s (x101 over 15m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 502

Is that the right url it is checking? http://:5001 ???

Konstantinos Passadis 17,286 Reputation points

2023-06-01T20:04:01.04+00:00

Hello @Jake K !

I am glad you found the issue

Could you kindly mark the answer as Accepted ?

Also great thinking to post the How to ! so it is good to mark the Answer , people will benefit from this thread a lot !

Regards
VasaviLankipalle-MSFT 14,181 Reputation points

2023-06-01T20:54:16.3866667+00:00

@Jake K , please check the URL it should be something like this: http://<container-ip>:5001/ Example: http://localhost:5001/

The deployment YAML file should contain a section related to liveness and readiness probes for the inference-server container. Try to check here and update.

Jake K 36

@VasaviLankipalle-MSFT Thanks for the suggestion. How do I do this?

I can see that in a regular deployment.yaml file you can do it as such

    livenessProbe:
      httpGet:
        path: /
        port: 5001
      initialDelaySeconds: 3
      periodSeconds: 3

But how do I do the same with the KubernetesOnlineDeployment class?

deployment = KubernetesOnlineDeployment(
    name="credit-quickstart-001-deploy-v1",
    endpoint_name=online_endpoint_name,
    model=model,
    resources=ResourceRequirementsSettings(
        requests=ResourceSettings(
            cpu="1",
            memory="1Gi",
        ),        
    ),
    liveness_probe(
        ???????????????? What goes here?    
    ),
    instance_type="verylargecompute",
    instance_count=1,
)

Thanks in advance!

Jake K 36

thanks again @VasaviLankipalle-MSFT

I was able to extend out the timings on the Liveness and Readiness. Now the container is running.

deployment = KubernetesOnlineDeployment(
    name="credit-quickstart-001-deploy-v1",
    endpoint_name=online_endpoint_name,
    model=model,
    resources=ResourceRequirementsSettings(
        requests=ResourceSettings(
            cpu="1",
            memory="1Gi",
        ),        
    ),
    liveness_probe=ProbeSettings(    
        initial_delay=30,
        period=30,
        timeout=10,
        success_threshold=1,
        failure_threshold=60
    ),
    readiness_probe=ProbeSettings(    
        initial_delay=30,
        period=30,
        timeout=10,
        success_threshold=1,
        failure_threshold=60
    ),    
    instance_type="verylargecompute",
    instance_count=1,
)

Accepted answer

Konstantinos Passadis 17,286 Reputation points

2023-06-01T14:18:25.15+00:00
Hello @Jake K !

This detail :

Reason: OOMKilled

Points to Memory

This typically occurs when the container exceeds the memory limits allocated to it and the operating system kills the process to free up resources.

If increasing the memory allocation for the container is not sufficient, you might need to consider scaling up the resources for your Azure Arc cluster. This could involve adding more nodes to the cluster or using nodes with higher memory capacity.

ALSO

The error message you provided indicates that the failure occurs within the inferenceserver container. You can try accessing the logs of that container to gather more information about the error. You can use the Kubernetes command-line tool (kubectl) to view the logs of the container. Run the following command to view the logs of the inferenceserver container:

kubectl logs <inferenceserver-pod-name> -c inferenceserver

Replace <inferenceserver-pod-name> with the actual name of the pod running the inferenceserver container.

The container logs may provide more detailed error messages or stack traces that can help pinpoint the cause of the failure.

I hope this helps!

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards
Please sign in to rate this answer.

2 people found this answer helpful.

0 comments No comments
Sign in to comment

Collecting package metadata (repodata.json): ...working... Error occurred

0 additional answers