Fine-tuning job stuck in "Training" status for over 24 hours - Microsoft Learning Exercise

James 125 Reputation points
2025-04-23T14:08:42.9066667+00:00

User's image

Hi there! I need some help with a fine-tuning job I started for a Microsoft Learning module. I'm working through the exercise at https://microsoftlearning.github.io/mslearn-ai-studio/Instructions/05-Finetune-model.html where you fine-tune a language model.

I started a fine-tuning job yesterday evening using the small travel assistant JSONL file from the tutorial, but it's still showing as "Training started" more than 24 hours later.

I'm pretty sure something's wrong because it's a small dataset that shouldn't take this long to train. Do you know if there are any issues with the fine-tuning service right now? On the first training run I did cancel it after 3 hours of still being on "Running" Any ideas on what might be happening or how I can fix it?

Thanks for your help!

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,665 questions
{count} vote

Accepted answer
  1. SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator
    2025-04-23T17:02:32.4766667+00:00

    Hello @James,

    I understand that your fine-tuning job is stuck in "Training" status for over 24 hours, there could be several reasons for this issue. Here are some potential causes and steps you can take:

    Reviewing the job status in the Fine-tuning section of the Azure AI Studio portal. Fine-tuning jobs are sometimes queued due to high demand or limited resources. Try refreshing the portal to check for any recent updates.

    Although you mentioned that the dataset is small, it's important to ensure that it meets the minimum requirements for fine-tuning. If the dataset is too small or not well-structured, it might lead to unexpected behavior.

    If the job continues to hang with no progress, consider canceling and resubmitting it. Since you've already canceled a previous long-running job, resubmitting with validated data and parameters might help rule out edge issues.

    Azure’s compute resources for fine-tuning are shared across tenants and may not be immediately available. Check the Azure Status Page to rule out regional delays or service incidents that might be impacting availability.

    Also you can refer Check the status of your custom model,

    Troubleshooting for Azure OpenAI fine-tuning.

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.