Trained yolov8 model on compute cluster and all metrics are flat

Abhishek Rajendra Jain 25 Reputation points
2025-03-11T13:31:54.7133333+00:00

Hi,

I have been training yolov8 model using compute cluster and after training is complete, I see all the metrics are flat as if model did not learn anything I don't know what went wrong but the training completed successfully with 50 epochs and confusion metrics shows nothing about the model learning as if no data was used while training

# Run the training

job = command(
    inputs=dict(
        training_data=Input(
            type="uri_folder",
            path="azureml:plandataset:2",
        ),
        model_to_train=Input(
            type="custom_model",
            path="azureml:yolov8m:2"
        )
    ),
    code="/home/azureuser/cloudfiles/code/Users/model_training/training-code",
    command="""
        sed -i "s|path:.*$|path: ${{ inputs.training_data }}|" data.yaml &&
        yolo task=detect train data=data.yaml model=${{ inputs.model_to_train }} epochs=50 batch=4 amp=True project=train-environment name=experiment
    """,
    environment="azureml:train-environment:2",
    compute="mel-compute",
    display_name="train-environment",
    experiment_name="train-environment"
)


ml_client.create_or_update(job)


Dataset folder on my local machine

User's image

And here's how I uploaded to ml workspace

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
 
# Create AzureML dataset
 
my_data = Data(
    path="dataset",
    type=AssetTypes.URI_FOLDER,
    description="Plans dataset",
    name="plandataset"
)
ml_client.data.create_or_update(my_data)

data.yaml

path: ../datasets/dataset # dataset root dir
train: images/train # train images (relative to 'path') 128 images
val: images/val # val images (relative to 'path') 128 images
test: images/test # test images (optional)

nc: 22
# I am not posting the classes name because of confidentiality


Dataset path looks like this in dataasset with train test and valid folders for both images and labels User's image

Here's the confusion metrics after the training completed showing the model did not learn anything

User's image

Args.yaml file

task: detect
mode: train
model: /mnt/azureml/cr/j/213a1becbd584dc98dbd30862504a442/cap/data-capability/wd/INPUT_model_to_train/best.pt
data: data.yaml
epochs: 50
patience: 50
batch: 4
imgsz: 1824
save: true
save_period: -1
cache: false
device: null
workers: 8
project: train-environment
name: experiment
exist_ok: false
pretrained: true
optimizer: auto
verbose: true
seed: 0
deterministic: true
single_cls: false
rect: false
cos_lr: false
close_mosaic: 10
resume: false
amp: true
fraction: 1.0
profile: false
freeze: null
overlap_mask: true
mask_ratio: 4
dropout: 0.0
val: true
split: val
save_json: false
save_hybrid: false
conf: null
iou: 0.7
max_det: 300
half: false
dnn: false
plots: true
source: null
show: false
save_txt: false
save_conf: false
save_crop: false
show_labels: true
show_conf: true
vid_stride: 1
stream_buffer: false
line_width: null
visualize: false
augment: false
agnostic_nms: false
classes: null
retina_masks: false
boxes: true
format: torchscript
keras: false
optimize: false
int8: false
dynamic: false
simplify: false
opset: null
workspace: 4
nms: false
lr0: 0.01
lrf: 0.01
momentum: 0.937
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
warmup_bias_lr: 0.1
box: 7.5
cls: 0.5
dfl: 1.5
pose: 12.0
kobj: 1.0
label_smoothing: 0.0
nbs: 64
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
degrees: 0.0
translate: 0.1
scale: 0.5
shear: 0.0
perspective: 0.0
flipud: 0.0
fliplr: 0.5
mosaic: 1.0
mixup: 0.0
copy_paste: 0.0
cfg: null
tracker: botsort.yaml
save_dir: train-environment/experiment

How do I debug this and want to know why and what went wrong?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,338 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 2,585 Reputation points Microsoft Employee Moderator
    2025-03-18T05:22:53.1133333+00:00

    Hi @Abhishek Rajendra Jain

    Thank you for your patience and for providing additional details. I understand that training the YOLOv8 model locally yielded good results, but you are facing issues when training on the compute cluster. Let's delve deeper into the possible causes and solutions.

    1. Verify Data Integrity:
      1. Environment & Dependency Differences: Azure Compute Clusters may use different versions of dependencies (CUDA, PyTorch, etc.) compared to your local setup. YOLOv8 is sensitive to environment configurations. Action:
        • Compare your local requirements.txt with the Azure environment. Use the pip freeze command locally and cross-check with the Azure cluster’s Conda/Pip setup.
        • Ensure CUDA/cuDNN versions match (e.g., CUDA 11.x for PyTorch 1.10+).
        • Reference: Azure ML Environment Management.
      2. Annotation Checks: Validate that annotations (bounding boxes, labels) are correctly formatted and not corrupted. You can use the following script to visualize dataset samples:
               from ultralytics.yolo.data.utils import visualize_dataset_samples
               visualize_dataset_samples('dataset.yaml')
        
      3. Class Imbalance: Check for missing or mislabeled classes. Ensure all classes in annotations match the YAML configuration.
    2. Inspect Training Configuration
      1. Pretrained Weights: Confirm that the model starts from pretrained weights (e.g., yolov8n.pt). Training from scratch without proper initialization often fails:
               model = YOLO('yolov8n.pt')  # Correct
               model = YOLO('yolov8n.yaml')  # Incorrect (unless intentional)
        
      2. Hyperparameters: Validate the learning rate (lr0), batch size, and epochs. Start with default values (e.g., lr0=0.01). Adjust if gradients vanish/explode (check training logs).
      3. Verbose Logging for Debugging: Add verbose=True to your YOLOv8 training command:
              model.train(..., verbose=True)
        
    3. Debug with Minimal Examples
      1. Overfit a Small Subset: Train on 10–20 images for 5–10 epochs. If the loss doesn’t drop, there’s a fundamental issue (e.g., data not loading).
      2. Data Loader Inspection: Use a script to verify data loading:
              dataset = model.train_data  # Access loaded dataset
              print(dataset.names)  # Class names
              for batch in dataset.train_loader:
                   print(batch['img'].shape, batch['cls'].shape)
                   break
        
    4. Analyze Training Logs
      1. Loss Components: Check if box_loss, cls_loss, and dfl_loss are updating. Flat losses across all components suggest no learning.
      2. Warnings/Errors: Look for CUDA errors, NaN values, or OOM (Out-of-Memory) issues in logs.
      3. Logging and Output Directory Permissions: If the Azure cluster lacks write permissions to the output directory, metrics may not be saved. Action:
        • Mount the output directory as a OutputFileDatasetConfig or use Azure’s default ./outputs folder (auto-uploads to the workspace). Example:
                  from azureml.core import Workspace, Dataset
                  ws = Workspace.from_config()
                  datastore = ws.get_default_datastore()
                  dataset = Dataset.File.from_files(path=(datastore, 'data/'))
          

    I hope this provides more clarity on how to approach the problem. Please try it out and let me know if any progress, will try our best to help you out.

    Thanks

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.