Aleks Adamovic I have checked internally and confirm that this issue is resolved. To give more context, it happened due to the job allocated only one node resource, causing jobs that require multiple node resources for fine-tune model training to wait for resources indefinitely.
Our team has rolled back this configuration and issue is resolved.
I would suggest you, kindly check now and update if you are still seeing this issue.
Thanks for your time and patience on this issue.