Hi,
We are using the Azure Machine Learning Studio to run pipelines in which computer vision models are trained using Tensorflow (v2.4.0).
Our input data (images & annotations) are stored on our Azure Blob Storage account.
The saved models are also saved to the same Azure Blob Storage account
We have several different pipelines (for different projects) that all worked perfectly fine for the last months... up until last week.
Every pipeline (that worked before) results in the exact same error now.
Everything (image loading, preprocessing, augmentations, ...) works fine.
The training step starts and the first epoch is trained.
However after the first epoch is done training, the error occurs.
tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:157 : Permission denied: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system
...
Epoch 00001: val_loss improved from inf to 0.10273, saving model to /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.19626808166503906 seconds
...
tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system [Op:SaveV2]
We get a PermissionDeniedError while trying to save a temporary file, apparently because it's a read-only file system.
If we take a look at this temporary file in the Azure storage account there is nothing that points out it would be read-only.
There's also no difference in settings between this file and other files that we were able to read/write.
I have already been able to find what direction to search in.
In the training pipeline step we have always been using Model Checkpoint (if the validation is better than the current saved checkpoint, the model checkpoint is saved).
By deleting the Model Checkpoint callback, the error does not occur.
This is not a solution of course, as we do need these model checkpoints.
It does show however that it has something to do with these model checkpoints.
I am not sure what else I can try to solve this issue.
Kind regards