Runtime starting timeout when using custom environment in Azure ML studio

Mariyan 0 Reputation points
2024-04-16T16:14:18.34+00:00

I just cannot get the Runtimes in Azure Machine Learning Studio to work with a custom environment. Even when the custom environment uses the same docker/ conda requirements specified by Microsoft environments (i am testing with llm-rag:48 env).

I am creating a compute instance, i am not using the automatic computing instance.

The custom environment builds the image successfully.

When connecting the runtime with the environment is where it goes wrong. It attempts it for 15 minutes and it basically times out.

I just don't get what the problem is and where it originates from. I have looked up at the troubleshooting pages: https://learn.microsoft.com/en-us/azure/machine-learning/prompt-flow/tools-reference/troubleshoot-guidance?view=azureml-api-2#runtime-failed-with-system-error-runtime-not-ready-when-you-used-a-custom-environment

but it still doesn't really answer when it originates from. I went to the terminal in the compute instance and checked the docker commands. The docker container is running and the image is pulled, every minute it seems the runtime is restarting. Trying to run docker logs on the runtime returns no result. I just don't get what I am supposed to do to investigate/ fix this issue.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,569 questions
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 46,976 Reputation points
    2024-04-17T03:19:51.9533333+00:00

    @Mariyan

    Thanks for reaching out to us, if you would like to, you can raise a support ticket to check on the back to see what happened. If you have no support plan, we can enable you a free ticket to do so.

    Generally, there are a few points you may want to consider -

    1. Validate Your Dockerfile: Ensure that your Dockerfile is correct and able to build and run locally without issues. You can build your Docker image locally and run it to see if there are any errors.
    2. Check The Image Size: Azure has a limit on the size of the Docker images, which is currently 20GB. If your image exceeds this limit, it could cause issues.
    3. Examine System Logs: You can try to fetch more detailed logs from the system to get additional clues about what might be going wrong. In Azure, you can usually access these logs through the Azure portal or the Azure CLI. Check the logs for any error messages or warnings that might suggest what the issue could be.
    4. Check Resource Allocation: Ensure that your compute instance has enough resources (CPU, memory, storage) to run the Docker container. If the container is too resource-intensive, it could cause the runtime to fail to start.
    5. Inspect Compute Instance: Go to the Azure portal and inspect your compute instance. Check its status and ensure it's in the 'Running' state. You can also try restarting it to see if that resolves the issue.

    Let us know how it works, I hope this helps.

    Regards,

    Yutong-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

    0 comments No comments