I am trying to create a custom environment to train a time series forecast prophet model in azure ml SDK. However when I submit the job, it's running for a while and then throws "AzureMLCompute job failed 500: [REDACTED]: Some(true)"

Damarla, Lokesh 0 Reputation points
2024-10-14T20:28:31.1733333+00:00

my yml code:

name: dev_env

channels:

  • conda-forge

dependencies:

  • python=3.10
  • numpy
  • pip
  • scikit-learn
  • scipy
  • pandas
  • pip:
  • azureml-core
  • plotly
  • kaleido
  • azure-ai-ml
  • azureml
  • inference-schema[numpy-support]==1.3.0
  • mlflow==2.8.0
  • mlflow-skinny==2.8.0
  • azureml-mlflow==1.51.0
  • psutil>=5.8,<5.9
  • tqdm>=4.59,<4.60
  • ipykernel~=6.0
  • matplotlib
  • prophet
  • azure-storage-blob

My Pipeline Python code:

from azureml.core import Workspace, Experiment

from azureml.core.compute import ComputeTarget

from azureml.pipeline.core import Pipeline

from azureml.pipeline.steps import PythonScriptStep

from azureml.core import Environment

from azureml.core.runconfig import RunConfiguration

Set up the workspace

ws = Workspace.from_config()

Define your compute instance

compute_target = ComputeTarget(workspace=ws, name="<compute_instance_name>")

Load environment from YAML

env = Environment.from_conda_specification(

name='predc_dev_env',

file_path='pdc_dev_env.yml',

)

Set up RunConfiguration

run_config = RunConfiguration()

run_config.environment = env

Define the PythonScriptStep for Apprhs Prophet Model Training

apprhs_prophet_step = PythonScriptStep(

name="Apprhs Prophet Model Training",

script_name="Apprhs_Prophet_Model_Training.py",

compute_target=compute_target,

runconfig=run_config,

source_directory="<path_to_file>",

allow_reuse=True

)

Define the PythonScriptStep for Apprhs RandomForest Model Training (depends on Cayuga Prophet step)

apprhs_rforest_step = PythonScriptStep(

name="Apprhs RForest Model Training",

script_name="Apprhs_RandomForest_Model_Training.py",

compute_target=compute_target,

runconfig=run_config,

source_directory="<path_to_file>",

allow_reuse=True

)

Define the PythonScriptStep for Apprhs RandomForest Model Training (depends on Cayuga Prophet step)

apprhs_model_selection_step = PythonScriptStep(

name="Model Selection and Moving Inference Files",

script_name="model_selection_accuracy_comparision.py",

compute_target=compute_target,

runconfig=run_config,

source_directory="<path_to_file>",

allow_reuse=True

)

Make the steps run sequentially by passing them as a sequence

from azureml.pipeline.core import StepSequence

step_sequence = StepSequence(steps=[apprhs_prophet_step, apprhs_rforest_step, apprhs_model_selection_step])

Define the pipeline

pipeline = Pipeline(workspace=ws, steps=step_sequence)

Submit the pipeline with a name

experiment = Experiment(workspace=ws, name="Dev-TrainingandModelSelection-Pipeline")

pipeline_run = experiment.submit(pipeline)

pipeline_run.wait_for_completion(show_output=True)

Note: My workspace is a private endpoint connection with a vnet/subnet integrated

and my container registry is also in the same vnet private endpoint connection

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,959 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 26,261 Reputation points
    2024-10-14T23:14:22.41+00:00

    I am not expert but the error message can tell that there is an issue with the underlying compute resources or the configuration related to network restrictions, especially since you mentioned that both the workspace and the container registry are in a private endpoint with a VNET.

    Start by verifying that your compute target (for example your compute instance or cluster) is properly configured to communicate with your workspace and container registry through the private endpoints.

    Make sure that the required outbound ports for Azure Machine Learning are open in your VNET, especially for accessing storage (port 443), container registry, and any external dependencies such as package repositories (for Conda or pip).

    Based on what I saw on some forums also the issue could be related to the environment setup, especially considering the mix of dependencies between Conda and pip. Some suggestions:

    • Check that prophet and its dependencies are being installed correctly. It might require specific versions of pystan or cmdstanpy, so you may need to explicitly specify these.
    • Try isolating the prophet installation by including its dependencies under pip or conda, rather than mixing them.

    If you're using a custom compute target (like a managed compute cluster), ensure that any required IP (from your workspace or storage) are whitelisted in your VNET or firewall configuration.

    Some links to help you :

    https://learn.microsoft.com/en-us/answers/questions/1363189/azuremlcompute-job-failed-500-redacted-sometrue

    https://techcommunity.microsoft.com/t5/azure-ml/how-to-build-an-environment-when-your-azure-ml-workspace-is-behind/ba-p/3696967

    https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-private-endpoint

    https://learn.microsoft.com/en-us/azure/machine-learning/how-to-secure-training-environments

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.