Trained yolov8 model on compute cluster and all metrics are flat

Question

Trained yolov8 model on compute cluster and all metrics are flat

Abhishek Rajendra Jain 25

Hi,

I have been training yolov8 model using compute cluster and after training is complete, I see all the metrics are flat as if model did not learn anything I don't know what went wrong but the training completed successfully with 50 epochs and confusion metrics shows nothing about the model learning as if no data was used while training

# Run the training

job = command(
    inputs=dict(
        training_data=Input(
            type="uri_folder",
            path="azureml:plandataset:2",
        ),
        model_to_train=Input(
            type="custom_model",
            path="azureml:yolov8m:2"
        )
    ),
    code="/home/azureuser/cloudfiles/code/Users/model_training/training-code",
    command="""
        sed -i "s|path:.*$|path: ${{ inputs.training_data }}|" data.yaml &&
        yolo task=detect train data=data.yaml model=${{ inputs.model_to_train }} epochs=50 batch=4 amp=True project=train-environment name=experiment
    """,
    environment="azureml:train-environment:2",
    compute="mel-compute",
    display_name="train-environment",
    experiment_name="train-environment"
)


ml_client.create_or_update(job)

Dataset folder on my local machine

User's image

And here's how I uploaded to ml workspace

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
 
# Create AzureML dataset
 
my_data = Data(
    path="dataset",
    type=AssetTypes.URI_FOLDER,
    description="Plans dataset",
    name="plandataset"
)
ml_client.data.create_or_update(my_data)

data.yaml

path: ../datasets/dataset # dataset root dir
train: images/train # train images (relative to 'path') 128 images
val: images/val # val images (relative to 'path') 128 images
test: images/test # test images (optional)

nc: 22
# I am not posting the classes name because of confidentiality

Dataset path looks like this in dataasset with train test and valid folders for both images and labels User's image

Here's the confusion metrics after the training completed showing the model did not learn anything

User's image

Args.yaml file

task: detect
mode: train
model: /mnt/azureml/cr/j/213a1becbd584dc98dbd30862504a442/cap/data-capability/wd/INPUT_model_to_train/best.pt
data: data.yaml
epochs: 50
patience: 50
batch: 4
imgsz: 1824
save: true
save_period: -1
cache: false
device: null
workers: 8
project: train-environment
name: experiment
exist_ok: false
pretrained: true
optimizer: auto
verbose: true
seed: 0
deterministic: true
single_cls: false
rect: false
cos_lr: false
close_mosaic: 10
resume: false
amp: true
fraction: 1.0
profile: false
freeze: null
overlap_mask: true
mask_ratio: 4
dropout: 0.0
val: true
split: val
save_json: false
save_hybrid: false
conf: null
iou: 0.7
max_det: 300
half: false
dnn: false
plots: true
source: null
show: false
save_txt: false
save_conf: false
save_crop: false
show_labels: true
show_conf: true
vid_stride: 1
stream_buffer: false
line_width: null
visualize: false
augment: false
agnostic_nms: false
classes: null
retina_masks: false
boxes: true
format: torchscript
keras: false
optimize: false
int8: false
dynamic: false
simplify: false
opset: null
workspace: 4
nms: false
lr0: 0.01
lrf: 0.01
momentum: 0.937
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
warmup_bias_lr: 0.1
box: 7.5
cls: 0.5
dfl: 1.5
pose: 12.0
kobj: 1.0
label_smoothing: 0.0
nbs: 64
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
degrees: 0.0
translate: 0.1
scale: 0.5
shear: 0.0
perspective: 0.0
flipud: 0.0
fliplr: 0.5
mosaic: 1.0
mixup: 0.0
copy_paste: 0.0
cfg: null
tracker: botsort.yaml
save_dir: train-environment/experiment

How do I debug this and want to know why and what went wrong?

Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-03-13T08:39:43.2966667+00:00
Hi @Abhishek Rajendra Jain

To address the issue where a YOLOv8 model's metrics remain flat after training, follow this structured debugging approach:

Verify Data Integrity

Dataset Paths: Ensure the dataset YAML file correctly points to the training/validation data on the compute cluster. Absolute paths are safer in distributed environments.

Annotation Checks: Validate that annotations (bounding boxes, labels) are correctly formatted and not corrupted.
from ultralytics.yolo.data.utils import visualize_dataset_samples visualize_dataset_samples('dataset.yaml')

Class Imbalance: Check for missing or mislabeled classes. Ensure all classes in annotations match the YAML configuration.

Inspect Training Configuration

Pretrained Weights: Confirm the model starts from pretrained weights (e.g., yolov8n.pt). Training from scratch without proper initialization often fails:
model = YOLO('yolov8n.pt') # Correct model = YOLO('yolov8n.yaml') # Incorrect (unless intentional)

Hyperparameters: Validate the learning rate (lr0), batch size, and epochs. Start with default values (e.g., lr0=0.01). Adjust if gradients vanish/explode (check training logs).

Debug with Minimal Examples

Overfit a Small Subset: Train on 10–20 images for 5–10 epochs. If loss doesn’t drop, there’s a fundamental issue (e.g., data not loading).

Data Loader Inspection: Use a script to verify data loading:
dataset = model.train_data # Access loaded dataset print(dataset.names) # Class names for batch in dataset.train_loader: print(batch['img'].shape, batch['cls'].shape) break

Analyze Training Logs

Loss Components: Check if box_loss, cls_loss, and dfl_loss are updating. Flat losses across all components suggest no learning.

Warnings/Errors: Look for CUDA errors, NaN values, or OOM (Out-of-Memory) issues in logs.

Mostly this issue can be fixed by incorrect paths in the dataset YAML after realizing the cluster used a different filesystem structure or missing pretrained weights caused the model to fail; initializing with yolov8n.pt fixed the issue.

You can follow the above steps for now and mean while i will review and get back to you on this.

Thanks
Abhishek Rajendra Jain 25 Reputation points

2025-03-13T09:26:46.41+00:00
Following are my observations and questions to the above answer:

To inspect data loader, you shared python script where should I execute this script and where to place it, I just have a notebook where I create ml client and execute the job and data.yaml file which defines the dataset path

I am using pretrained model which is registered in ml studio, so I believe model is correct and loaded properly

This is the label.jpg file from artifacts if the path would have been incorrect the instance count would be zero, right?

I have attached two files here one contains standard logs and other contains results of the training you can look into it.

std_log.txt

results.txt (this is csv file since it does not allow csv file I have changed the extension)

As per results.csv the class loss became zero after 5th epoch.

I doubt if the below code works with the latest version of ultralytics can you check and confirm?

from ultralytics.yolo.data.utils import visualize_dataset_samples visualize_dataset_samples('dataset.yaml')
Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-03-17T09:19:05.7833333+00:00
Hi @Abhishek Rajendra Jain

Greetings!!

I understand that you're facing issues with your YOLOv8 model training on a compute cluster, where all metrics are flat. Let's address your observations and questions one by one:

To inspect data loader, you shared python script where should I execute this script and where to place it, I just have a notebook where I create ml client and execute the job and data.yaml file which defines the dataset path

You can execute the provided Python script in the same environment where you run your notebook. Place the script in the same directory as your notebook or specify the path to the script in your notebook. You can use the %run magic command in Jupyter Notebook to execute the script or simply run it as a cell in your notebook.

I am using pretrained model which is registered in ml studio, so I believe model is correct and loaded properly

Since you are using a pretrained model registered in ML Studio, it is likely that the model is correct and loaded properly. However, ensure that the model's weights and architecture are compatible with your dataset and training configuration.

This is the label.jpg file from artifacts if the path would have been incorrect the instance count would be zero, right?

If the path to the label.jpg file from artifacts was incorrect, the instance count would indeed be zero. This indicates that the data loader is correctly accessing the dataset.

I have attached two files here one contains standard logs and other contains results of the training you can look into it.

Based on the std_log.txt and results.txt files, it appears that the class loss became zero after the 5th epoch. This could indicate that the model is not learning effectively from the data. Here are some potential reasons and solutions:

Data Quality: Ensure that your dataset is properly labeled and balanced. Poor quality or imbalanced data can lead to ineffective training.

Learning Rate: The learning rate might be too high or too low. Experiment with different learning rates to find the optimal value.

Model Architecture: Verify that the model architecture is suitable for your dataset. You might need to adjust the architecture or use a different model variant.

Training Configuration: Check other training parameters such as batch size, epochs, and augmentation techniques. Adjusting these parameters can help improve training performance.

I doubt if the below code works with the latest version of ultralytics can you check and confirm?

The code snippet you provided:

from ultralytics.yolo.data.utils import visualize_dataset_samples visualize_dataset_samples('dataset.yaml')

It should work with the latest version of Ultralytics.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Thanks
Abhishek Rajendra Jain 25 Reputation points

2025-03-17T17:03:57.9266667+00:00

Hi Vikram, I did train the yolov8 model locally with the same dataset and same parameter and model trained successfully with good accuracy. I am still not satisfied with the above answer you provided can you share more detailed clarity on how I should approach this problem instead of just replying yes and no to my queries. I highly appreciate your effort for the quick response
Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-03-20T04:14:36.52+00:00

Hi @Abhishek Rajendra Jain

Greetings!

We haven’t heard from you on the last response and was just checking back to see if you got a chance to try above suggestions.

Please remember to close the thread by selecting "Accept the answer” and “up-vote” / click on yes wherever the information provided has been helpful. This can be beneficial to other community members.

Thank you.

1 answer

Your answer

Abhishek Rajendra Jain 25 Reputation points

2025-03-13T09:26:46.41+00:00

Following are my observations and questions to the above answer:

To inspect data loader, you shared python script where should I execute this script and where to place it, I just have a notebook where I create ml client and execute the job and data.yaml file which defines the dataset path

I am using pretrained model which is registered in ml studio, so I believe model is correct and loaded properly

This is the label.jpg file from artifacts if the path would have been incorrect the instance count would be zero, right?

I have attached two files here one contains standard logs and other contains results of the training you can look into it.

std_log.txt

results.txt (this is csv file since it does not allow csv file I have changed the extension)

As per results.csv the class loss became zero after 5th epoch.

I doubt if the below code works with the latest version of ultralytics can you check and confirm?

from ultralytics.yolo.data.utils import visualize_dataset_samples visualize_dataset_samples('dataset.yaml')
Abhishek Rajendra Jain 25 Reputation points

2025-03-17T17:03:57.9266667+00:00

Hi Vikram, I did train the yolov8 model locally with the same dataset and same parameter and model trained successfully with good accuracy. I am still not satisfied with the above answer you provided can you share more detailed clarity on how I should approach this problem instead of just replying yes and no to my queries. I highly appreciate your effort for the quick response
Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-03-20T04:14:36.52+00:00

Hi @Abhishek Rajendra Jain

Greetings!

We haven’t heard from you on the last response and was just checking back to see if you got a chance to try above suggestions.

Please remember to close the thread by selecting "Accept the answer” and “up-vote” / click on yes wherever the information provided has been helpful. This can be beneficial to other community members.

Thank you.

Answer 1

Hi @Abhishek Rajendra Jain

Thank you for your patience and for providing additional details. I understand that training the YOLOv8 model locally yielded good results, but you are facing issues when training on the compute cluster. Let's delve deeper into the possible causes and solutions.

Verify Data Integrity:
1. Environment & Dependency Differences: Azure Compute Clusters may use different versions of dependencies (CUDA, PyTorch, etc.) compared to your local setup. YOLOv8 is sensitive to environment configurations. Action:
  - Compare your local requirements.txt with the Azure environment. Use the pip freeze command locally and cross-check with the Azure cluster’s Conda/Pip setup.
  - Ensure CUDA/cuDNN versions match (e.g., CUDA 11.x for PyTorch 1.10+).
  - Reference: Azure ML Environment Management.
2. Annotation Checks: Validate that annotations (bounding boxes, labels) are correctly formatted and not corrupted. You can use the following script to visualize dataset samples:
```
       from ultralytics.yolo.data.utils import visualize_dataset_samples
       visualize_dataset_samples('dataset.yaml')
```
3. Class Imbalance: Check for missing or mislabeled classes. Ensure all classes in annotations match the YAML configuration.
Inspect Training Configuration
1. Pretrained Weights: Confirm that the model starts from pretrained weights (e.g., yolov8n.pt). Training from scratch without proper initialization often fails:
```
       model = YOLO('yolov8n.pt')  # Correct
       model = YOLO('yolov8n.yaml')  # Incorrect (unless intentional)
```
2. Hyperparameters: Validate the learning rate (lr0), batch size, and epochs. Start with default values (e.g., lr0=0.01). Adjust if gradients vanish/explode (check training logs).
3. Verbose Logging for Debugging: Add verbose=True to your YOLOv8 training command:
```
      model.train(..., verbose=True)
```
Debug with Minimal Examples
1. Overfit a Small Subset: Train on 10–20 images for 5–10 epochs. If the loss doesn’t drop, there’s a fundamental issue (e.g., data not loading).
2. Data Loader Inspection: Use a script to verify data loading:
```
      dataset = model.train_data  # Access loaded dataset
      print(dataset.names)  # Class names
      for batch in dataset.train_loader:
           print(batch['img'].shape, batch['cls'].shape)
           break
```
Analyze Training Logs
1. Loss Components: Check if box_loss, cls_loss, and dfl_loss are updating. Flat losses across all components suggest no learning.
2. Warnings/Errors: Look for CUDA errors, NaN values, or OOM (Out-of-Memory) issues in logs.
3. Logging and Output Directory Permissions: If the Azure cluster lacks write permissions to the output directory, metrics may not be saved. Action:
  - Mount the output directory as a OutputFileDatasetConfig or use Azure’s default ./outputs folder (auto-uploads to the workspace). Example:
```
        from azureml.core import Workspace, Dataset
        ws = Workspace.from_config()
        datastore = ws.get_default_datastore()
        dataset = Dataset.File.from_files(path=(datastore, 'data/'))
```

I hope this provides more clarity on how to approach the problem. Please try it out and let me know if any progress, will try our best to help you out.

Thanks

Share via

Trained yolov8 model on compute cluster and all metrics are flat

1 answer

Your answer