Getting 'something went wrong when calculating the best trial for this sweep job. Please see the trials tab ..` error when I run sweep job in azure ml

Question

Getting 'something went wrong when calculating the best trial for this sweep job. Please see the trials tab ..` error when I run sweep job in azure ml

Karthik Shankar 0

I am running a sweep job to perform hyperparameter tunning in Azure ML using Python SDK V2. Even though the job runs successfully, it's not showing me the best run and showing this error:

something went wrong when calculating the best trial for this sweep job. Please see the trials tab for accurate best trial reporting.

But then, the sample space is run for different combinations of the parameter values and the resultant job completetion looks like this:

enter image description here

The individual experiments must come under the ... Job in the fourth line. And finally sweep job telling me what the best trial was.

I used the same code as below for another run, and that job ran fine:

Can someone please let me know where I am going wrong.

The code is as below:

Training script:

training_job = command(name='credit_default_train8',
                        display_name='Credit Default Job',
                        description='Credit default training job',
                        environment=env,
                        code=training_folder,
                        inputs={
                            'train_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/train_data'),
                            'test_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/test_data'),
                            'n_estimators' : 100,
                            'learning_rate' : 0.001,
                                },
                        # outputs={
                        #     'model' : Output(type='uri_folder', mode = 'rw_mount')
                        # },
                        command='''python train.py \
                                    --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} \
                                    --n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}'''
                        )#--model ${{outputs.model}}

Command:

training_job = command(name='credit_default_train8',
                        display_name='Credit Default Job',
                        description='Credit default training job',
                        environment=env,
                        code=training_folder,
                        inputs={
                            'train_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/train_data'),
                            'test_data' : Input(type='uri_folder', path='azureml://datastores/workspaceblobstore/paths/test_data'),
                            'n_estimators' : 100,
                            'learning_rate' : 0.001,
                                },
                        # outputs={
                        #     'model' : Output(type='uri_folder', mode = 'rw_mount')
                        # },
                        command='''python train.py \
                                    --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} \
                                    --n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}'''
                        )#--model ${{outputs.model}}

from azure.ai.ml.entities import AmlCompute
cluster_name = 'cpu-cluster'

try:
    ml_client.compute.get(name=cluster_name)
    print(f'You already have a cluster with name {cluster_name}')
except:    

    compute_cluster = AmlCompute(name=cluster_name,
                                description='Compute Cluster to run sweep job',
                                min_instances=0,
                                max_instances=4,
                                idle_time_before_scale_down=60,
                                size='Standard_E16s_v3')
    print(f'Creating a new cluster with name {cluster_name}')
    ml_client.compute.begin_create_or_update(compute_cluster)

from azure.ai.ml.sweep import MedianStoppingPolicy
stop_policy = MedianStoppingPolicy(delay_evaluation=5,
                            evaluation_interval=1)

from azure.ai.ml.sweep import Choice, Normal

command_job_for_sweep = training_job(
                        n_estimators = Choice([100, 150, 200]),
                        learning_rate = Choice([0.1, 0.001, 1])
)

sweep_job = \
command_job_for_sweep.sweep(
                            primary_metric='f1_score',
                            goal='Maximize',
                            early_termination_policy=stop_policy,
                            compute=cluster_name,
                            sampling_algorithm='grid')

sweep_job.experiment_name = 'Sweep-job'
sweep_job.set_limits(max_concurrent_trials=4, timeout=1800, max_total_trials=10)

returned_sweep_job = ml_client.jobs.create_or_update(sweep_job)

Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-03-31T13:13:03.07+00:00
Hi Karthik Shankar

It should be command job to run sweep job , could not find n_estimator param for command job.

Below is syntax found in official doc.

command_job_for_sweep = command_job( learning_rate=Uniform(min_value=0.01, max_value=0.9), boosting=Choice(values=["gbdt", "dart"]), )

Hope it helps.

Thank you.
Karthik Shankar 0 Reputation points

2025-03-31T15:29:00.47+00:00

@Manas Mohanty I did run the sweep job using the base command job i.e. training_job. I've run another experiment with same set of hyperparameters: learning_rate and n_estimators. And that ran fine. I am unable to find the issue with the present experiment.
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-04-01T11:25:51.0033333+00:00
Hi Karthik Shankar

You can find the best trial from SDK too.

# Download best trial model output ml_client.jobs.download(returned_sweep_job.name, output_name="model")

Another suggestion is to check other metrics and terminal policy.

Reference - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#find-the-best-trial-job

Could you share a sample toy code to replicate the issue fully at our side.

Thank you
Karthik Shankar 0 Reputation points

2025-04-02T03:49:19.33+00:00

@Manas Mohanty I have added the training and command scripts. Can you please have a look
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-04-03T05:11:40.3233333+00:00
Hi Karthik Shankar

I am not able to find "best trial" in UI for below metrices

accuracy, F1-score , goal="Maximize"

But gives best trial with primary_metric = test-multi_logloss

# Call sweep() on your command job to sweep over your parameter expressions sweep_job = command_job_for_sweep.sweep( compute="cpu-cluster", sampling_algorithm="random", primary_metric="test-multi_logloss", goal="Minimize", )

I have used the sample notebook -Samples/SDK v2/sdk/python/jobs/single-step/lightgbm/iris/lightgbm-iris-sweep.ipynb as given in Hyperparameter sweep documentation.

Hope it helps.

Thank you
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-04-04T08:35:12.2733333+00:00

Hi Karthik Shankar

We have not heard from you. Hope the pointers shared helped.

Thank you.
Karthik Shankar 0 Reputation points

2025-04-04T15:24:55.5366667+00:00

@Manas Mohanty I've used F1_score as primary metric in another sweep job that ran fine. Not sure why it's not working here
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-04-07T08:58:50.12+00:00

Hello @Karthik Shankar ,

What are the metrics you are logging in both of your sweep job. Because as per this documentation.

primary_metric: The name of the primary metric needs to exactly match the name of the metric logged by the training script

So, please confirm it and also provide me the training script train.py.

Thank you

1 answer

Your answer

Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-03-31T13:13:03.07+00:00

Hi Karthik Shankar

It should be command job to run sweep job , could not find n_estimator param for command job.

Below is syntax found in official doc.

command_job_for_sweep = command_job( learning_rate=Uniform(min_value=0.01, max_value=0.9), boosting=Choice(values=["gbdt", "dart"]), )

Hope it helps.

Thank you.
Karthik Shankar 0 Reputation points

2025-03-31T15:29:00.47+00:00

@Manas Mohanty I did run the sweep job using the base command job i.e. training_job. I've run another experiment with same set of hyperparameters: learning_rate and n_estimators. And that ran fine. I am unable to find the issue with the present experiment.
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-04-01T11:25:51.0033333+00:00

Hi Karthik Shankar

You can find the best trial from SDK too.

# Download best trial model output ml_client.jobs.download(returned_sweep_job.name, output_name="model")

Another suggestion is to check other metrics and terminal policy.

Reference - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#find-the-best-trial-job

Could you share a sample toy code to replicate the issue fully at our side.

Thank you
Karthik Shankar 0 Reputation points

2025-04-02T03:49:19.33+00:00

@Manas Mohanty I have added the training and command scripts. Can you please have a look
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-04-03T05:11:40.3233333+00:00

Hi Karthik Shankar

I am not able to find "best trial" in UI for below metrices

accuracy, F1-score , goal="Maximize"

But gives best trial with primary_metric = test-multi_logloss

# Call sweep() on your command job to sweep over your parameter expressions sweep_job = command_job_for_sweep.sweep( compute="cpu-cluster", sampling_algorithm="random", primary_metric="test-multi_logloss", goal="Minimize", )

I have used the sample notebook -Samples/SDK v2/sdk/python/jobs/single-step/lightgbm/iris/lightgbm-iris-sweep.ipynb as given in Hyperparameter sweep documentation.

Hope it helps.

Thank you
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-04-04T08:35:12.2733333+00:00

Hi Karthik Shankar

We have not heard from you. Hope the pointers shared helped.

Thank you.
Karthik Shankar 0 Reputation points

2025-04-04T15:24:55.5366667+00:00

@Manas Mohanty I've used F1_score as primary metric in another sweep job that ran fine. Not sure why it's not working here
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-04-07T08:58:50.12+00:00

Hello @Karthik Shankar ,

What are the metrics you are logging in both of your sweep job. Because as per this documentation.

primary_metric: The name of the primary metric needs to exactly match the name of the metric logged by the training script

So, please confirm it and also provide me the training script train.py.

Thank you

Answer 1

JAYA SHANKAR G S 4,035 Microsoft External Staff Moderator

Hello @Karthik Shankar ,

I can be able to reproduce your error passing unknown primary metric.

As per this documentation you should be logging the metric whichever you passing to hypermeter.

So, go to metrics tab in your training job check for f1_score.

enter image description here

If you don't find, then you log it like below in your training script.

mlflow.log_metric('f1_score', float(f1_score))

Here the name should match while logging and passing it as primary_metric.

You can check the sample training script for Accuracy, similarly you calculate f1 score and log it.

Let me know if you have any query.

If above answer helped you please do accept it and take a feedback survey clicking on yes.

Thank you

JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-04-08T05:36:43.7133333+00:00

Hello @Karthik Shankar ,

Please check the above solution and let me know if you have any doubt or query.

Thank you
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-04-09T04:10:03.73+00:00

Hello @Karthik Shankar ,

I believe the given above solution helped you, please do accept and give feedback by clicking on yes.

Thank you

Share via

Getting 'something went wrong when calculating the best trial for this sweep job. Please see the trials tab ..` error when I run sweep job in azure ml

1 answer

Your answer