PermissionDeniedError when trying to save a Tensorflow model checkpoint

Hi,
We are using the Azure Machine Learning Studio to run pipelines in which computer vision models are trained using Tensorflow (v2.4.0).
Our input data (images & annotations) are stored on our Azure Blob Storage account.
The saved models are also saved to the same Azure Blob Storage account
We have several different pipelines (for different projects) that all worked perfectly fine for the last months... up until last week.
Every pipeline (that worked before) results in the exact same error now.
Everything (image loading, preprocessing, augmentations, ...) works fine.
The training step starts and the first epoch is trained.
However after the first epoch is done training, the error occurs.
tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:157 : Permission denied: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system
...
Epoch 00001: val_loss improved from inf to 0.10273, saving model to /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.19626808166503906 seconds
...
tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system [Op:SaveV2]
We get a PermissionDeniedError while trying to save a temporary file, apparently because it's a read-only file system.
If we take a look at this temporary file in the Azure storage account there is nothing that points out it would be read-only.
There's also no difference in settings between this file and other files that we were able to read/write.
I have already been able to find what direction to search in.
In the training pipeline step we have always been using Model Checkpoint (if the validation is better than the current saved checkpoint, the model checkpoint is saved).
By deleting the Model Checkpoint callback, the error does not occur.
This is not a solution of course, as we do need these model checkpoints.
It does show however that it has something to do with these model checkpoints.
I am not sure what else I can try to solve this issue.
Kind regards
@romungi-MSFT
Thanks for your response!
I do see that runs before the 24th of January were working (last run was on 14th of January), and after this date (first run after was on 31th of January) they did not work anymore.
However, I am using the Azure ML SDK, not the designer. In my code I have pinned the version of azureml-core to v1.28.0 using a pip requirements file.
I have tried several things today:
Are there any other things that we can try?
Kind regards,
Michiel
I am sure you are using the default storage container of your workspace and this role should be present on your storage account.
But could you check if this is assigned on your storage account from the IAM tab for your storage account for the workspace?
@romungi-MSFT
I checked the IAM tab for the storage account of my workspace and the Storage Blob Data Contributor is in there.
@Michiel VAN ACKER Maybe we can try one last time to check if reducing the no. of characters of the pipeline name or run could help.
I have got this pointer from one previous issue reported from the keras repo. Because nothing else changed in your pipeline I think the length of file name could be causing the issue. Because it is Azure storage I think some limitations from windows might be playing their part.
@romungi-MSFT
After reading the issue reported on the keras repo, I have changed two things:
We get the same error again.
The name of the temporary file is quite long (part-00000-of-00001.data-00000-of-00001.tempstate1567952979554311564).
However we're not capable of changing it as this is done internally in keras.
Kind regards
@romungi-MSFT
Any more things you can think of to try?
I have sent feedback to MS using the smiley icon in Machine Learning Workspace, but no answer yet.
@Michiel VAN ACKER I think we have checked all possible leads at this point but the issue seems persistent with your workspace and subscription. If you do not see any response from the feedback option we can help you with a one time free support case if you do not have a valid support plan. This will help to have direct access to a support engineer who could review your subscription and workspace with more details and advise what could be going wrong. Please let me know if you need one. Thanks!!
@romungi-MSFT
I would like to take that option and use the one time free support case.
What do I need to do?
Good morning @romungi-MSFT , I have the exact same problem is this case whereby there is an error tensorflow.python.framework.errors_impl.PermissionDeniedError when trying to save a tensorflow model weights.
I have pinned the version of azureml-dataset-runtime (to v1.25.0) & have checked for the appropriate role assigned on the storage account.
Is there any thing else I can do to resolve this issue?
Thank you for your time.
@Joshua Tan I think in this case downgrading azureml-core to a lower version worked for @Michiel VAN ACKER
Could you check if it is possible to do so in your environment?
@romungi-MSFT Thank you for the quick reply! On my end, I have tried downgrading azureml-core from v1.25.0 to v1.23.0, v1.20.0, v1.18.0 but all resulted in the same error message.
Sign in to comment
2 answers
Sort by: Most helpful
Hi @Michiel VAN ACKER , thank you for the report.
This issue is indeed caused by the new mounting solution in the new runtime. While being way faster and more performant it turned out there is a missing support for a rename operation on the mounted drive. Tensorflow saves model checkpoint to a temp file and then attempts to rename it and fails. Error message is very misleading and was in part a reason why it took us so long to root cause it.
We are disabling new mounting solution on our side and will reenable it once we validate support. Thank you for your patience and do not hesitate to reach out if you see any future data access issues.
Sign in to comment
I just had a call with a Support Engineer of Microsoft and they solved the issue.
Apparently they are implementing a new version of AzureML Compute Runtime.
My problem was solved by reversing to an older version of this.
Sign in to comment
Activity