ML model deployment issue

Ankit Rawat 1 Reputation point
2021-04-27T11:51:48.143+00:00

I am trying to deploy an ML classification model on Azure using GUI.

After registering/uploading the model inside the portal, I am deploying the model in the Azure container instance, with custom entry_script and the conda dependencies.

Entry Script

# Importing Pacakges  
import pandas as pd  
import pickle  
import regex, json  
import numpy as np  
import sklearn  
import os  
  
from inference_schema.schema_decorators import input_schema, output_schema  
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType  
  
def init():  
    global model  
    global classes  
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'randomForest50.pkl')  
    model = pickle.load(open(model_path, "rb"))  
    classes = lambda x : ["F", "M"][x]  
  
input_sample = np.array([['Thomas', 'Anna']])  
output_sample = np.array(['m', 'F'])  
  
  
@input_schema('data', NumpyParameterType(input_sample))  
@output_schema(NumpyParameterType(output_sample))  
def run(data):  
    try:  
        namesList = json.loads(data)["data"]["names"]  
        pred = list(map(classes, model.predict(preprocessing(namesList))))  
        return str(pred[0])  
    except Exception as e:  
        error = str(e)  
        return error  

Conda.yaml

name: prediction  
dependencies:  
- python=3.7  
- numpy  
- scikit-learn  
- pip:  
    - azureml-defaults  
    - pandas  
    - pickle4  
    - regex  
    - inference-schema[numpy-support]     

After deployment, the endpoint deployment state goes to unhealthy. and the logs show that program is stuck in a loop. Check logs below:

2021-04-26T08:14:55,433967500+00:00 - rsyslog/run   
2021-04-26T08:14:55,421414500+00:00 - iot-server/run   
2021-04-26T08:14:55,540534600+00:00 - gunicorn/run   
2021-04-26T08:14:55,646209100+00:00 - nginx/run   
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...  
2021-04-26T08:14:58,234212800+00:00 - iot-server/finish 1 0  
2021-04-26T08:14:58,324505300+00:00 - Exit code 1 is normal. Not restarting iot-server.  
Starting gunicorn 19.9.0  
Listening at: http://127.0.0.1:31311 (62)  
Using worker: sync  
worker timeout is set to 300  
Booting worker with pid: 89  
SPARK_HOME not set. Skipping PySpark Initialization.  
Initializing logger  
2021-04-26 08:15:11,623 | root | INFO | Starting up app insights client  
2021-04-26 08:15:11,624 | root | INFO | Starting up request id generator  
2021-04-26 08:15:11,631 | root | INFO | Starting up app insight hooks  
2021-04-26 08:15:11,632 | root | INFO | Invoking user's init function  
worker timeout is set to 300  
Booting worker with pid: 91  
SPARK_HOME not set. Skipping PySpark Initialization.  
Initializing logger  
2021-04-26 08:15:29,014 | root | INFO | Starting up app insights client  
2021-04-26 08:15:29,014 | root | INFO | Starting up request id generator  
2021-04-26 08:15:29,014 | root | INFO | Starting up app insight hooks  
2021-04-26 08:15:29,014 | root | INFO | Invoking user's init function  
worker timeout is set to 300  
Booting worker with pid: 98  
SPARK_HOME not set. Skipping PySpark Initialization.  
...  
...  
...  

I tried to deploy the model using python also. But it also failed with message:

WebserviceException: WebserviceException:  
 Message: Service deployment polling reached non-successful terminal state, current service state: Failed  
Operation ID: 98e464d4-5b15-4606-936f-a2625f7bd1fd  
More information can be found using '.get_logs()'  
Error:  
{  
  "code": "AciDeploymentFailed",  
  "statusCode": 400,  
  "message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\n\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\n\t2. You can interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.",  
  "details": [  
    {  
      "code": "CrashLoopBackOff",  
      "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\n\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\n\t2. You can interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information."  
    },  
    {  
      "code": "AciDeploymentFailed",  
      "message": "Your container application crashed. Please follow the steps to debug:\n\t1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\n\t2. If your container application crashed. This may be caused by errors in your scoring file's init() function. You can try debugging locally first. Please refer to https://aka.ms/debugimage#debug-locally for more information.\n\t3. You can also interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n\t4. View the diagnostic events to check status of container, it may help you to debug the issue.\n\"RestartCount\": 3\n\"CurrentState\": {\"state\":\"Waiting\",\"startTime\":null,\"exitCode\":null,\"finishTime\":null,\"detailStatus\":\"CrashLoopBackOff: Back-off restarting failed\"}\n\"PreviousState\": {\"state\":\"Terminated\",\"startTime\":\"2021-04-27T10:46:03.903Z\",\"exitCode\":111,\"finishTime\":\"2021-04-27T10:46:07.524Z\",\"detailStatus\":\"Error\"}\n\"Events\":\n{\"count\":1,\"firstTimestamp\":\"2021-04-27T10:42:37Z\",\"lastTimestamp\":\"2021-04-27T10:42:37Z\",\"name\":\"Pulling\",\"message\":\"pulling image \\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\"\",\"type\":\"Normal\"}\n{\"count\":1,\"firstTimestamp\":\"2021-04-27T10:44:15Z\",\"lastTimestamp\":\"2021-04-27T10:44:15Z\",\"name\":\"Pulled\",\"message\":\"Successfully pulled image \\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\"\",\"type\":\"Normal\"}\n{\"count\":4,\"firstTimestamp\":\"2021-04-27T10:44:40Z\",\"lastTimestamp\":\"2021-04-27T10:46:03Z\",\"name\":\"Started\",\"message\":\"Started container\",\"type\":\"Normal\"}\n{\"count\":4,\"firstTimestamp\":\"2021-04-27T10:44:43Z\",\"lastTimestamp\":\"2021-04-27T10:46:07Z\",\"name\":\"Killing\",\"message\":\"Killing container with id 5c5ddb266c4b38b1c306367712d9bec0687e5f6979e34afea7f6b943edf7db75.\",\"type\":\"Normal\"}\n"  
    }  
  ]  
}  
 InnerException None  
 ErrorResponse   
{  
    "error": {  
        "message": "Service deployment polling reached non-successful terminal state, current service state: Failed\nOperation ID: 98e464d4-5b15-4606-936f-a2625f7bd1fd\nMore information can be found using '.get_logs()'\nError:\n{\n  \"code\": \"AciDeploymentFailed\",\n  \"statusCode\": 400,\n  \"message\": \"Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\\n\\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\\n\\t2. You can interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n\\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\",\n  \"details\": [\n    {\n      \"code\": \"CrashLoopBackOff\",\n      \"message\": \"Your container application crashed. This may be caused by errors in your scoring file's init() function.\\n\\t1. Please check the logs for your container instance: d16. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.\\n\\t2. You can interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n\\t3. You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\"\n    },\n    {\n      \"code\": \"AciDeploymentFailed\",\n      \"message\": \"Your container application crashed. Please follow the steps to debug:\\n\\t1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\\n\\t2. If your container application crashed. This may be caused by errors in your scoring file's init() function. You can try debugging locally first. Please refer to https://aka.ms/debugimage#debug-locally for more information.\\n\\t3. You can also interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n\\t4. View the diagnostic events to check status of container, it may help you to debug the issue.\\n\\\"RestartCount\\\": 3\\n\\\"CurrentState\\\": {\\\"state\\\":\\\"Waiting\\\",\\\"startTime\\\":null,\\\"exitCode\\\":null,\\\"finishTime\\\":null,\\\"detailStatus\\\":\\\"CrashLoopBackOff: Back-off restarting failed\\\"}\\n\\\"PreviousState\\\": {\\\"state\\\":\\\"Terminated\\\",\\\"startTime\\\":\\\"2021-04-27T10:46:03.903Z\\\",\\\"exitCode\\\":111,\\\"finishTime\\\":\\\"2021-04-27T10:46:07.524Z\\\",\\\"detailStatus\\\":\\\"Error\\\"}\\n\\\"Events\\\":\\n{\\\"count\\\":1,\\\"firstTimestamp\\\":\\\"2021-04-27T10:42:37Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:42:37Z\\\",\\\"name\\\":\\\"Pulling\\\",\\\"message\\\":\\\"pulling image \\\\\\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\\\\\"\\\",\\\"type\\\":\\\"Normal\\\"}\\n{\\\"count\\\":1,\\\"firstTimestamp\\\":\\\"2021-04-27T10:44:15Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:44:15Z\\\",\\\"name\\\":\\\"Pulled\\\",\\\"message\\\":\\\"Successfully pulled image \\\\\\\"20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631@sha256:322ebafbe88e98b0f57104fd0afad08a5caf57cc5e7f64b3b629c3ea50f54bb3\\\\\\\"\\\",\\\"type\\\":\\\"Normal\\\"}\\n{\\\"count\\\":4,\\\"firstTimestamp\\\":\\\"2021-04-27T10:44:40Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:46:03Z\\\",\\\"name\\\":\\\"Started\\\",\\\"message\\\":\\\"Started container\\\",\\\"type\\\":\\\"Normal\\\"}\\n{\\\"count\\\":4,\\\"firstTimestamp\\\":\\\"2021-04-27T10:44:43Z\\\",\\\"lastTimestamp\\\":\\\"2021-04-27T10:46:07Z\\\",\\\"name\\\":\\\"Killing\\\",\\\"message\\\":\\\"Killing container with id 5c5ddb266c4b38b1c306367712d9bec0687e5f6979e34afea7f6b943edf7db75.\\\",\\\"type\\\":\\\"Normal\\\"}\\n\"\n    }\n  ]\n}"  
    }  
}  

I have deployed the same model with the same entryScript.py and the same conda.yaml previously, and it worked fine.

I cannot figure out what can be the issue here. Can anybody please suggest to me something for solving this?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,559 questions
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 46,326 Reputation points
    2021-04-27T19:57:27.553+00:00

    Hello,

    Thanks for reaching out to us. Based on the log, it seems your container application crashed and this may be caused by errors in your scoring file's init() function.

    You can run service.get_logs() to get log information from the unhealthy service to see what's causing it to fail. Please refer to https://aka.ms/debugimage#debug-locally for more information.

    You can interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information. View the diagnostic events to check status of container, it may help you to debug the issue.

    You can also try to run image 20dd0f745f704eeb89ef4d52057871a0.azurecr.io/azureml/azureml_b9e8a2e66019f74c902eacced9684631 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.

    More information will help to find out the reason.

    Regards,
    Yutong

    0 comments No comments