Azure job stucks on running status with HF trainer.train()
Hi,
I'm facing a problem in Azure ML when launching a job about fine tuning a transformer. I'm fine tuning Whisper model from OpenAI for a speech-to-text task, and I am following this guide carefully: https://huggingface.co/blog/fine-tune-whisper
I created the .py and the last instruction is:
trainer.train()
and I launched an Azure ML job in this way:
job = command(
code="./src", # location of source code
command="python main.py",
display_name="whisper_ft",
compute="gpu-compute",
environment="pytorch_cuda_env_custom@latest"
)
ml_client.create_or_update(job)
but the job stucks on running status, even if the training ends successfully (I tried it without job execution). Probably the problem is on trainer.train() instrunction since if I comment it the job goes in Completed status.
Anyone else with the same problem?