Debugging guide for Model Serving

Artikkel
10/30/2024

This article demonstrates debugging steps for common issues that users might encounter when working with model serving endpoints. Common issues could include errors users encounter when the endpoint fails to initialize or start, build failures related to the container, or problems during the operation or running of the model on the endpoint.

Access and review logs

Databricks recommends reviewing build logs for debugging and troubleshooting errors in your model serving workloads. See Monitor model quality and endpoint health for information about logs and how to view them.

Check the event logs for the model in the workspace UI and check for a successful container build message. If you do not see a build message after an hour, reach out to Databricks support for assistance.

If your build is successful, but you encounter other errors see Debugging after container build succeeds. If your build fails, see Debugging after container build failure.

Installed library package versions

In your build logs you can confirm the package versions that are installed.

For MLflow versions, if you do not have a version specified, Model Serving uses the latest version.
For custom GPU serving, Model Serving installs the recommended versions of cuda and cuDNN according to public PyTorch and Tensorflow documentation.

Debugging after container build succeeds

Even if the container builds successfully, there might be issues when you run the model or during the operation of the endpoint itself. The following subsections detail common issues and how to troubleshoot and debug

Missing dependency

You might get an error like An error occurred while loading the model. No module named <module-name>.. This error might indicate that a dependency is missing from the container. Verify that you properly denoted all the dependencies that should be included in the build of the container. Pay special attention to custom libraries and ensure that the .whl files are included as artifacts.

Service logs looping

If your container build fails, check the service logs to see if you notice them looping when the endpoint tries to load the model. If you see this behavior try the following steps:

Open a notebook and attach to an All-Purpose cluster that uses a Databricks Runtime version, not Databricks Runtime for Machine Learning.
Load the model using MLflow and try debugging from there.

You can also load the model locally on your PC and debug from there. Load your model locally using the following:

import os
import mlflow

os.environ["MLFLOW_TRACKING_URI"] = "databricks://PROFILE"

ARTIFACT_URI = "model_uri"
if '.' in ARTIFACT_URI:
    mlflow.set_registry_uri('databricks-uc')
local_path = mlflow.artifacts.download_artifacts(ARTIFACT_URI)
print(local_path)

conda env create -f local_path/artifact_path/conda.yaml
conda activate mlflow-env

mlflow.pyfunc.load_model(local_path/artifact_path)

Model fails when requests are sent to the endpoint

You might receive an error like Encountered an unexpected error while evaluating the model. Verify that the input is compatible with the model for inference. when predict() is called on your model.

There is a code issue in the predict() function. Databricks recommends that you load the model from MLflow in a notebook and call it. Doing so highlights the issues in the predict() function, and you can see where the failure is happening within the method.

Workspace exceeds provisioned concurrency

You might receive a Workspace exceeded provisioned concurrency quota error.

You can increase concurrency depending on region availability. Reach out to your Databricks account team and provide your workspace ID to request a concurrency increase.

Debugging after container build failure

This section details issues that might occur when your build fails.

`OSError: [Errno 28] No space left on device`

The No space left error can be due to too many large artifacts being logged alongside the model unnecessarily. Check in MLflow that extraneous artifacts are not logged alongside the model and try to redeploy the slimmed down package.

Azure Firewall issues with serving models from Unity Catalog

You might see an error like Build could not start due to an internal error. If you are serving a model from UC and Azure Firewall is enabled, this is not supported by default..

Reach out to your Databricks account team to help resolve.

Build failure due to lack of GPU availability

You might see an error like Build could not start due to an internal error - please contact your Databricks representative..