TensorFlow PermissionDeniedError while running a code on Azure VM and saving checkpoints to Azure Files

Question

TensorFlow PermissionDeniedError while running a code on Azure VM and saving checkpoints to Azure Files

Andrii Velikorodnii 1

Hi,
I am using IMPALA distributed reinforcement learning architecture from DeepMind and trying to train it on separate machines. One has a GPU and runs a central learner, three others are CPU - heavy mahines and they run bunch of actors. In IMPALA architecture central learner has to share its weights with actors, they make decisions based on estimated policy from the central learner. To share those weights I use Azure Files.
OS I am using - Ubuntu 20.04. All VMs and Azure File instance are located in the same region.

The problem:

Central learner saves checkpoint with it's weigths every 30 seconds on mounted Azure Files storage. Azure Files instance is mounted with code from the "Connect" tab. The problem is that after some time and after saving some checkpoints it gives this error:

Saving checkpoint: /mnt/impala-shared/model-zoo1/checkpoints/default   
Traceback (most recent call last):  
  File "run_impala_distributed.py", line 94, in <module>  
    main()  
  File "run_impala_distributed.py", line 63, in main  
    agent_learner.run()  
  File "/home/azureuser/impala-acme-cage2/impala/core/agent_learner.py", line 150, in run  
    learner.step(p=True)  
  File "/home/azureuser/impala-acme-cage2/impala/core/learning.py", line 178, in step  
    if self.checkpointer.save():  
  File "/home/azureuser/impala-acme-cage2/impala/utils/savers.py", line 144, in save  
    self._checkpoint_manager.save()  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 827, in save  
    self._record_state()  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 724, in _record_state  
    update_checkpoint_state_internal(  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 244, in update_checkpoint_state_internal  
    file_io.atomic_write_string_to_file(coord_checkpoint_filename,  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 645, in atomic_write_string_to_file  
    rename(temp_pathname, filename, overwrite)  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 604, in rename  
    rename_v2(oldname, newname, overwrite)  
  File "/home/azureuser/anaconda3/envs/azureuser/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 620, in rename_v2  
    _pywrap_file_io.RenameFile(  
tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/impala-shared/model-zoo1/checkpoints/default/checkpoint.tmp60137726f31c46c288490f8bfb184931; Permission denied  
[reverb/cc/platform/default/server.cc:84] Shutting down replay server

But at this point it saved 75 checkpoints already and didn't have any issue. It happens on every run but at random point. It can save 10 checkpoints and then get permission denied or save 200 and get permission denied on 201. I thought it happened because actor is restoring this checkpoint when learner tries to rename it due to limit of checkpoints to save. But I have uncapped the checkpoint limit and the error still occurs. This error doesn't give me any idea what is broken exactly and weight sharing is crucial for this architecture, so I must fix it but I don't know how, because I don't understant what is broken

Kind regards

UPD I saw kind of similar question permissiondeniederror-when-trying-to-save-a-tensor.html and @Andrei Liakhovich replied that it may be an error in handling file renaming. But my problem has random nature. I hope @romungi-MSFT or @Andrei Liakhovich can check this problem.

Andrei Liakhovich 11 Reputation points

2022-06-06T18:39:48.95+00:00

Could you please share run_id for for the run where the failure occurs?
This doesn't look like the same issue to me, but I can take a look.

Your answer

Andrei Liakhovich 11 Reputation points

2022-06-06T18:39:48.95+00:00

Could you please share run_id for for the run where the failure occurs?
This doesn't look like the same issue to me, but I can take a look.

Share via

TensorFlow PermissionDeniedError while running a code on Azure VM and saving checkpoints to Azure Files

Your answer