Missing logfiles in resumed job

Scheuplein, Joshua 5 Reputation points
2024-09-25T12:00:50.98+00:00

Hello,

I am running a DL job in Azure ML Studio using low priority nodes. In order to resume the training after an interruption due to preempted nodes, I have adapted my code in such a way that it automatically continues with the next epoch as soon as a new node becomes available. For a first test run, this resulted in multiple directories in the "Outputs + logs" section of my job, where each directory contains the "std_log.txt" logfile of a single retry run (See the attached screenshot).

However, I have started another job based on the exact same implementation and there only a single logfile of the last run is shown. Azure ML Studio seems to somehow overwrite the previous "std_log.txt" file and doesn't create new directories with separate logfiles for each retry. What could possibly cause this behavior and how can I ensure that always all logfiles are saved properly?

Best regards!

Expected Output

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,929 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.