TensorFlow PermissionDeniedError while running a code on Azure VM and saving checkpoints to Azure Files
Hi,
I am using IMPALA distributed reinforcement learning architecture from DeepMind and trying to train it on separate machines. One has a GPU and runs a central learner, three others are CPU - heavy mahines and they run bunch of actors. In IMPALA architecture central learner has to share its weights with actors, they make decisions based on estimated policy from the central learner. To share those weights I use Azure Files.
OS I am using - Ubuntu 20.04. All VMs and Azure File instance are located in the same region.
The problem:
Central learner saves checkpoint with it's weigths every 30 seconds on mounted Azure Files storage. Azure Files instance is mounted with code from the "Connect" tab. The problem is that after some time and after saving some checkpoints it gives this error:
Saving checkpoint: /mnt/impala-shared/model-zoo1/checkpoints/default
Traceback (most recent call last):
File "run_impala_distributed.py", line 94, in <module>
main()
File "run_impala_distributed.py", line 63, in main
agent_learner.run()
File "/home/azureuser/impala-acme-cage2/impala/core/agent_learner.py", line 150, in run
learner.step(p=True)
File "/home/azureuser/impala-acme-cage2/impala/core/learning.py", line 178, in step
if self.checkpointer.save():
File "/home/azureuser/impala-acme-cage2/impala/utils/savers.py", line 144, in save
self._checkpoint_manager.save()
File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 827, in save
self._record_state()
File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 724, in _record_state
update_checkpoint_state_internal(
File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 244, in update_checkpoint_state_internal
file_io.atomic_write_string_to_file(coord_checkpoint_filename,
File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 645, in atomic_write_string_to_file
rename(temp_pathname, filename, overwrite)
File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 604, in rename
rename_v2(oldname, newname, overwrite)
File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 620, in rename_v2
_pywrap_file_io.RenameFile(
tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/impala-shared/model-zoo1/checkpoints/default/checkpoint.tmp60137726f31c46c288490f8bfb184931; Permission denied
[reverb/cc/platform/default/server.cc:84] Shutting down replay server
But at this point it saved 75 checkpoints already and didn't have any issue. It happens on every run but at random point. It can save 10 checkpoints and then get permission denied or save 200 and get permission denied on 201. I thought it happened because actor is restoring this checkpoint when learner tries to rename it due to limit of checkpoints to save. But I have uncapped the checkpoint limit and the error still occurs. This error doesn't give me any idea what is broken exactly and weight sharing is crucial for this architecture, so I must fix it but I don't know how, because I don't understant what is broken
Kind regards
UPD I saw kind of similar question permissiondeniederror-when-trying-to-save-a-tensor.html and @Andrei Liakhovich replied that it may be an error in handling file renaming. But my problem has random nature. I hope @romungi-MSFT or @Andrei Liakhovich can check this problem.