Sophia Li Thanks for the confirmation.I have checked internally and confirm that this issue is resolved. To give more context, it happened due to the job allocated only one node resource, causing jobs that require multiple node resources for fine-tune model training to wait for resources indefinitely.
Our team has rolled back this configuration and issue is resolved.
If the response helped, please do click Accept Answer
and Yes
for was this answer helpful.
Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.