TensorFlow PermissionDeniedError while running a code on Azure VM and saving checkpoints to Azure Files

Andrii Velikorodnii 1 Reputation point
2022-06-02T13:43:46.067+00:00

Hi,
I am using IMPALA distributed reinforcement learning architecture from DeepMind and trying to train it on separate machines. One has a GPU and runs a central learner, three others are CPU - heavy mahines and they run bunch of actors. In IMPALA architecture central learner has to share its weights with actors, they make decisions based on estimated policy from the central learner. To share those weights I use Azure Files.
OS I am using - Ubuntu 20.04. All VMs and Azure File instance are located in the same region.

The problem:

Central learner saves checkpoint with it's weigths every 30 seconds on mounted Azure Files storage. Azure Files instance is mounted with code from the "Connect" tab. The problem is that after some time and after saving some checkpoints it gives this error:

Saving checkpoint: /mnt/impala-shared/model-zoo1/checkpoints/default   
Traceback (most recent call last):  
  File "run_impala_distributed.py", line 94, in <module>  
    main()  
  File "run_impala_distributed.py", line 63, in main  
    agent_learner.run()  
  File "/home/azureuser/impala-acme-cage2/impala/core/agent_learner.py", line 150, in run  
    learner.step(p=True)  
  File "/home/azureuser/impala-acme-cage2/impala/core/learning.py", line 178, in step  
    if self.checkpointer.save():  
  File "/home/azureuser/impala-acme-cage2/impala/utils/savers.py", line 144, in save  
    self._checkpoint_manager.save()  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 827, in save  
    self._record_state()  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 724, in _record_state  
    update_checkpoint_state_internal(  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 244, in update_checkpoint_state_internal  
    file_io.atomic_write_string_to_file(coord_checkpoint_filename,  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 645, in atomic_write_string_to_file  
    rename(temp_pathname, filename, overwrite)  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 604, in rename  
    rename_v2(oldname, newname, overwrite)  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 620, in rename_v2  
    _pywrap_file_io.RenameFile(  
tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/impala-shared/model-zoo1/checkpoints/default/checkpoint.tmp60137726f31c46c288490f8bfb184931; Permission denied  
[reverb/cc/platform/default/server.cc:84] Shutting down replay server  

But at this point it saved 75 checkpoints already and didn't have any issue. It happens on every run but at random point. It can save 10 checkpoints and then get permission denied or save 200 and get permission denied on 201. I thought it happened because actor is restoring this checkpoint when learner tries to rename it due to limit of checkpoints to save. But I have uncapped the checkpoint limit and the error still occurs. This error doesn't give me any idea what is broken exactly and weight sharing is crucial for this architecture, so I must fix it but I don't know how, because I don't understant what is broken

Kind regards

UPD I saw kind of similar question permissiondeniederror-when-trying-to-save-a-tensor.html and @Andrei Liakhovich replied that it may be an error in handling file renaming. But my problem has random nature. I hope @romungi-MSFT or @Andrei Liakhovich can check this problem.

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
9,013 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.