Getting 'something went wrong when calculating the best trial for this sweep job. Please see the trials tab ..` error when I run sweep job in azure ml

Karthik Shankar 0 Reputation points
2025-03-28T14:46:19.5366667+00:00

I am running a sweep job to perform hyperparameter tunning in Azure ML using Python SDK V2. Even though the job runs successfully, it's not showing me the best run and showing this error:

something went wrong when calculating the best trial for this sweep job. Please see the trials tab for accurate best trial reporting.

But then, the sample space is run for different combinations of the parameter values and the resultant job completetion looks like this:

enter image description here

The individual experiments must come under the ... Job in the fourth line. And finally sweep job telling me what the best trial was.

I used the same code as below for another run, and that job ran fine:

Can someone please let me know where I am going wrong.

The code is as below:

Training script:

training_job = command(name='credit_default_train8',
                        display_name='Credit Default Job',
                        description='Credit default training job',
                        environment=env,
                        code=training_folder,
                        inputs={
                            'train_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/train_data'),
                            'test_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/test_data'),
                            'n_estimators' : 100,
                            'learning_rate' : 0.001,
                                },
                        # outputs={
                        #     'model' : Output(type='uri_folder', mode = 'rw_mount')
                        # },
                        command='''python train.py \
                                    --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} \
                                    --n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}'''
                        )#--model ${{outputs.model}}

Command:

training_job = command(name='credit_default_train8',
                        display_name='Credit Default Job',
                        description='Credit default training job',
                        environment=env,
                        code=training_folder,
                        inputs={
                            'train_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/train_data'),
                            'test_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/test_data'),
                            'n_estimators' : 100,
                            'learning_rate' : 0.001,
                                },
                        # outputs={
                        #     'model' : Output(type='uri_folder', mode = 'rw_mount')
                        # },
                        command='''python train.py \
                                    --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} \
                                    --n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}'''
                        )#--model ${{outputs.model}}
from azure.ai.ml.entities import AmlCompute
cluster_name = 'cpu-cluster'

try:
    ml_client.compute.get(name=cluster_name)
    print(f'You already have a cluster with name {cluster_name}')
except:    

    compute_cluster = AmlCompute(name=cluster_name,
                                description='Compute Cluster to run sweep job',
                                min_instances=0,
                                max_instances=4,
                                idle_time_before_scale_down=60,
                                size='Standard_E16s_v3')
    print(f'Creating a new cluster with name {cluster_name}')
    ml_client.compute.begin_create_or_update(compute_cluster)

from azure.ai.ml.sweep import MedianStoppingPolicy
stop_policy = MedianStoppingPolicy(delay_evaluation=5,
                            evaluation_interval=1)

from azure.ai.ml.sweep import Choice, Normal

command_job_for_sweep = training_job(
                        n_estimators = Choice([100, 150, 200]),
                        learning_rate = Choice([0.1, 0.001, 1])
)

sweep_job = \
command_job_for_sweep.sweep(
                            primary_metric='f1_score',
                            goal='Maximize',
                            early_termination_policy=stop_policy,
                            compute=cluster_name,
                            sampling_algorithm='grid')

sweep_job.experiment_name = 'Sweep-job'
sweep_job.set_limits(max_concurrent_trials=4, timeout=1800, max_total_trials=10)

returned_sweep_job = ml_client.jobs.create_or_update(sweep_job)
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,332 questions
{count} votes

1 answer

Sort by: Most helpful
  1. JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator
    2025-04-07T11:43:55.2466667+00:00

    Hello @Karthik Shankar ,

    I can be able to reproduce your error passing unknown primary metric.

    As per this documentation you should be logging the metric whichever you passing to hypermeter.

    So, go to metrics tab in your training job check for f1_score.

    enter image description here

    If you don't find, then you log it like below in your training script.

    mlflow.log_metric('f1_score', float(f1_score))
    

    Here the name should match while logging and passing it as primary_metric.

    You can check the sample training script for Accuracy, similarly you calculate f1 score and log it.

    Let me know if you have any query.

    If above answer helped you please do accept it and take a feedback survey clicking on yes.

    Thank you


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.