Azure job stucks on running status with HF trainer.train()

Luigi Montaleone 0 Reputation points
2024-10-03T09:37:49.5166667+00:00

Hi,

I'm facing a problem in Azure ML when launching a job about fine tuning a transformer. I'm fine tuning Whisper model from OpenAI for a speech-to-text task, and I am following this guide carefully: https://huggingface.co/blog/fine-tune-whisper

I created the .py and the last instruction is:

trainer.train()

and I launched an Azure ML job in this way:

job = command(
    code="./src",  # location of source code
    command="python main.py",
    display_name="whisper_ft",
    compute="gpu-compute",
    environment="pytorch_cuda_env_custom@latest"

)

ml_client.create_or_update(job)

but the job stucks on running status, even if the training ends successfully (I tried it without job execution). Probably the problem is on trainer.train() instrunction since if I comment it the job goes in Completed status.

Anyone else with the same problem?

Azure Machine Learning
{count} votes

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.