PermissionDeniedError when trying to save a Tensorflow model checkpoint

Question

Hi,

We are using the Azure Machine Learning Studio to run pipelines in which computer vision models are trained using Tensorflow (v2.4.0).
Our input data (images & annotations) are stored on our Azure Blob Storage account.
The saved models are also saved to the same Azure Blob Storage account

We have several different pipelines (for different projects) that all worked perfectly fine for the last months... up until last week.
Every pipeline (that worked before) results in the exact same error now.
Everything (image loading, preprocessing, augmentations, ...) works fine.
The training step starts and the first epoch is trained.
However after the first epoch is done training, the error occurs.

tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:157 : Permission denied: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system
...
Epoch 00001: val_loss improved from inf to 0.10273, saving model to /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.19626808166503906 seconds
...
tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system [Op:SaveV2]

We get a PermissionDeniedError while trying to save a temporary file, apparently because it's a read-only file system.
If we take a look at this temporary file in the Azure storage account there is nothing that points out it would be read-only.
There's also no difference in settings between this file and other files that we were able to read/write.

I have already been able to find what direction to search in.
In the training pipeline step we have always been using Model Checkpoint (if the validation is better than the current saved checkpoint, the model checkpoint is saved).
By deleting the Model Checkpoint callback, the error does not occur.
This is not a solution of course, as we do need these model checkpoints.
It does show however that it has something to do with these model checkpoints.

I am not sure what else I can try to solve this issue.

Kind regards

Answer

Hi @Michiel VAN ACKER , thank you for the report.
This issue is indeed caused by the new mounting solution in the new runtime. While being way faster and more performant it turned out there is a missing support for a rename operation on the mounted drive. Tensorflow saves model checkpoint to a temp file and then attempts to rename it and fails. Error message is very misleading and was in part a reason why it took us so long to root cause it.

We are disabling new mounting solution on our side and will reenable it once we validate support. Thank you for your patience and do not hesitate to reach out if you see any future data access issues.

Answer

I just had a call with a Support Engineer of Microsoft and they solved the issue.
Apparently they are implementing a new version of AzureML Compute Runtime.
My problem was solved by reversing to an older version of this.

PermissionDeniedError when trying to save a Tensorflow model checkpoint

2 answers