AzureMLCompute job failed 500: [REDACTED]: Some(true) Error while creating custom environment in azure ml

Sena Aslan 25 Reputation points
2024-07-08T14:00:04.7766667+00:00

Hello everyone,

I am trying to create a custom environment to train and deploy a catboost regression model in azure ml SDK. However when I submit the job, it's running for a while and then throws "AzureMLCompute job failed 500: [REDACTED]: Some(true)" error. When I check the logs for the job, I couldn't find anything to solve the problem. Actually there was nothing in the logs. Can you please help me identify and solve the problem ?

Here is my environment definition, and the job to create the env.

channels:
  - conda-forge
dependencies:
  - python=3.10
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=1.0.2
  - scipy=1.7.1
  - pandas~=1.5.3
  - catboost
  - pip:
      - inference-schema[numpy-support]~=1.5.0
      - packaging==23.2
      - cloudpickle==2.2.1
      - mlflow==2.8.0
      - mlflow-skinny==2.8.0
      - azureml-mlflow==1.51.0
      - psutil==5.8.0
      - pyyaml==6.0.1
      - tqdm>=4.59,<4.60
      - ipykernel~=6.0
      - azureml-inference-server-http
      - azureml-core
      - azureml-dataset-runtime[fuse]
      - azureml-fsspec
name: model-env
import os
#create a source folder for the script
train_src_dir = "./pipeline_src"
os.makedirs(train_src_dir, exist_ok=True)


from azure.ai.ml.entities import Environment
#create and register this custom environment in your workspace:
custom_env_name = "model-env"
custom_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for catboost reg",
    tags={"scikit-learn": "1.0.2"},
    conda_file=os.path.join(train_src_dir, "conda.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}")
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,726 questions
{count} votes

Accepted answer
  1. YutongTie-MSFT 48,586 Reputation points
    2024-07-23T06:15:39.7566667+00:00

    Hello Sena and Alex,

    Thanks for sharing the solution, will escalate this issue to document team to see how to doc this fine. If Sena feels Alex's answer is helpful, please kindly accept it so that more people can see.

    I will repo Sena's answer here for Sena's convenience to accept since the question poster can not accept her/his own answer as some limitation.

    Thanks again for reporting the issue and posting the solution.

    For private workspace, you only need to run these codes once. There is no need to run these codes everytime when creating an environment.

    
    #set compute cluster for environment job
    
    from azure.ai.ml import MLClient
    from azure.identity import DefaultAzureCredential
    subscription_id = "<subscription id>"
    resource_group = "<resource group>"
    workspace = "<workspace name>"
    ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
    )
    
    # Get workspace info
    ws=ml_client.workspaces.get(name=workspace)
    
    # Update to use cpu-cluster for image builds
    ws.image_build_compute="<compute cluster name>"
    
    # To switch back to using ACR to build (if ACR is not in the VNet):
    # ws.image_build_compute = ''
    ml_client.workspaces.begin_update(ws)
    
    #set legacy mode of the workspace to False
    Python
    from azureml.core import Workspace
    ws = Workspace.from_config()
    ws.update(v1_legacy_mode=False)
    

    Appreciated again.

    Regards,

    Yutong

    -Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Alex Szymanczak 5 Reputation points
    2024-07-18T13:44:22.92+00:00

    Hi I'm having the same issue but in my case I can't disable v1_legacy_mode. I think the issue based on Sena's fix is that the compute used by default to prepare the images is the serverless one which is outwit the private endpoint configuration.

    I resorted to specifying the compute and building the environment explicitly but this appears to be a breaking change in behaviour. I'm still on the old version of the SDK but for anyone looking for an answer here is what worked for me:

    my_environment = Environment('<environment>')
    compute_name = "<compute-within-vnet>"
    my_environment.build(ws, compute_name) # line I didn't need prior to this
    

    If possible it would be great to make logs for this more explicit to ease with troubleshooting.

    1 person found this answer helpful.
    0 comments No comments

  2. Sena Aslan 25 Reputation points
    2024-07-09T08:39:30.7266667+00:00

    I was able to solve the problem. Anyone who encounters such an error, here is what you should do:

    Note : For private workspace, you only need to run these codes once. There is no need to run these codes everytime when creating an environment.

    #set compute cluster for environment job

    from azure.ai.ml import MLClient
    
    from azure.identity import DefaultAzureCredential
    
    
    subscription_id = "<subscription id>"
    
    resource_group = "<resource group>"
    
    workspace = "<workspace name>"
    
    
    ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
    )
    
    # Get workspace info
    
    ws=ml_client.workspaces.get(name=workspace)
    
    # Update to use cpu-cluster for image builds
    
    ws.image_build_compute="<compute cluster name>"
    
    # To switch back to using ACR to build (if ACR is not in the VNet):
    
    # ws.image_build_compute = ''
    
    ml_client.workspaces.begin_update(ws)
    

    #set legacy mode of the workspace to False

    from azureml.core import Workspace
    
    ws = Workspace.from_config()
    
    ws.update(v1_legacy_mode=False)
    
    0 comments No comments