Udostępnij za pośrednictwem


The File System is unpredictable

One of the more frequent questions I answer on StackOverflow is a variation of the following. 

I’m doing XXX with a file, how can I know if the file exists?

The variations include verify no one else has the file open, if the file is in use, the file is not writable, etc ….  The answer to all of these questions is unfortunately the same.  Simply put you can’t.  The reason why is the fundamental nature of the file system prevents such predictive operations. 

The file system is a resource with multiple levels of control that is shared between all users and processes in the system.  The levels of control include but are not limited to file system and sharing permissions.  At any point in time any entity on the computer may change a file system object or it’s controls in any number of ways.  For example

  • The file could be deleted
  • A file could be created at place one previously did not exist
  • Permissions could change on the file in such a way that the current process does not have access
  • Another process could open the file in such a way that is not conducive to sharing
  • The user remove the USB key containing the file
  • The network connection to the mapped drive could get disconnected

Or in short

The file system is best viewed as a multi-threaded object over which you have no reliable synchronization capabilities

Many developers, and APIs for that matter, though treat the file system as though it’s a static resource and assume what’s true at one point in time will be true later.  Essentially using the result of one operation to predict the success or failure of another.  This ignores the possibility of the above actions interweaving in between calls.  It leads to code which reads well but executes badly in scenarios where more than one entity is changing the file system.

These problems are best demonstrated by a quick sample.  Lets keep it simple and take a stab at a question I’ve seen a few times.  The challenge is to write a function which returns all of the text from a file if it exists and an empty string if it does not.  To simplify this problem lets assume permissions are not an issue, paths are properly formatted, paths point to local drives and people aren’t randomly ripping out USB keys.  Using the System.IO.File APIs we may construct the following solution.

 static string ReadTextOrEmpty(string path) {
    if (File.Exists(path)) {
        return File.ReadAllText(path); // Bug!!!
    } else {
        return String.Empty;
    }
}

This code reads great and at a glance looks correct but is actually fundamentally flawed.  The reason why is the code changes depends on the call to File.Exist to be true for a large portion of the function.  It’s being used to predict the success of the call to ReadAllText.  However there is nothing stopping the file from being deleted in between these two calls.  In that case the call to File.ReadAllText would throw a FileNotFoundException which is exactly what the API is trying to prevent!

This code is flawed because it’s attempting to use one piece of data to make a prediction about the future state of the file system.  This is simply not possible with the way the file system is designed.  It’s a shared resource with no reliable synchronization mechanism.  File.Exists is much better named as File.ExistedInTheRecentPast (the name gets much worse if you consider the impact of permissions). 

Knowing this, how could we write ReadTextOrEmpty in a reliable fashion?  Even though you can not make predictions on the file system the failures of operations is a finite set.  So instead of attempting to predict successful conditions for the method, why not just execute the operation and deal with the consequences of failure?  

 static string ReadTextOrEmpty(string path) {
    try {
        return File.ReadAllText(path);
    } catch (DirectoryNotFoundException) {
        return String.Empty;
    } catch (FileNotFoundException) {
        return String.Empty;
    }
}

This implementation provides the original requested behavior.  In the case the file exists, for the duration of the operation, it returns the text of the file and if not returns an empty string. 

In general I find the above pattern is the best way to approach the file system.  Do the operations you want and deal with the consequences of failure in the form of exceptions.  To do anything else involves an unreliable prediction in which you still must handle the resulting exceptions. 

If this is the case then why have File.Exist at all if the results can’t be trusted?  It depends on the level of reliability you want to achieve.  In production programs I flag any File.Exist I find as a bug because reliability is a critical component.  However you’ll see my personal powershell configuration scripts littered with calls to File.Exsit.  Simply put because I’m a bit lazy in those scripts because critical reliability is not important when I’m updating my personal .vimrc file.

Comments

  • Anonymous
    December 09, 2009
    I'm thinking of a performance implication of the second approach. For me, I see the if-then-else approach is cheaper than the try-catch one in terms of performance. But I think there must be a trade-off between the reliability and performance costs. It means that the first approach will be chosen if the performance cost is higher than the reliability cost. It means that it is assumed that there's a little chance that a reliability issue would occur. Thus, performance justification is preferred. The other way around for the second approach. What do you think?

  • Anonymous
    December 09, 2009
    File.Exists is also useful when testing for flag files.  Ie, a lot of mainframe systems used to create empty files as a trigger for the completion of a batch process.   But this is a pet peeve of mine, as well.  

  • Anonymous
    December 09, 2009
    @Maximilian: the exception version is only slower if the exception is actually thrown.  Most of the time you don't really expect that to happen.  If you do expect it to happen, you're probably in one of those places where the correctness issue is important.

  • Anonymous
    December 10, 2009
    @Joel: yes, you're right. I meant that. I should've been clearer about it :). Thanks for the clarification. I mean I'm just trying to put both approaches in a more granular way. To see if things are put appropriately based on appropriate conditions as well: case-by-case. For example, if I knew I open a read-only file, put locally in the same folder as the application, why should I try the try-catch approach instead of the if-then-else one? Therefore, should I use the second approach for all file related operations, or at least File.ReadAllText() as above? I'm just thinking I shouldn't do that. There are cases when the file system is predictable as well. Anyway, this is just my limited opinion based on my experience. CMIIW.

  • Anonymous
    December 10, 2009
    You may always combine both if you expect many cases where the file is unavailable. Stop, not may, but have to. Any file operation may fail and a good program knows how to deal with that. There should be no file operation without exception handling. My limited experiance.

  • Anonymous
    December 10, 2009
    Thanks for this.  This is a very good way of looking at the file system.  I never really thought of it that way before.  I will remember "File.ExistedInTheRecentPast".  Thanks!

  • Anonymous
    December 10, 2009
    @Maximilian I avoid ever optimizing code for performance reasons until I've actually proved it's a problem with the profiler.  The reason being that developers are notoriously bad at predicting what will be an actual problem.  As a consequence I've spent a good deal of time undoing performance "fixes" which actually caused performance problems.   Additionally if don't measure, you have no idea what you fixed.   I'm certainly not denying that exceptions are slower than a simple if block.  But if it's not in a critical path of your application the performance difference is quite possibly unimportant.  

  • Anonymous
    December 10, 2009
    @Maximilian (second post).  Even putting a read only file in the same directory as your application does not make it safe.  There is nothing preventing that file from being deleted by another process, an IO exception occuring during the processing of the file or a network connection to that location dropping while you are reading.   True, the likely hood of these occurring could be low.  But it's a reliability trade off that must be made based on your particular app.  

  • Anonymous
    December 11, 2009
    The method you are writing should start off with a check for Exists, if no file exists - return String.Empty.  Then if the file exists, then go into the try/catch method.  This will save stack unwinding and other exception related performance  based on the fact that you looked for the file first.  So, you are basically flipping how you are using the .Exists method.

  • Anonymous
    December 11, 2009
    The test and test-and-set deoptimization wouldn't be so attractive if Microsoft had followed their own best practices and made the Try* pattern available for these methods (although I think that's really not necessary for ReadAllText, there should be a File.TryOpen).

  • Anonymous
    December 11, 2009
    I understand that exceptions are slower than if statements, but unless you are doing this in a very tight loop, I think it is a bit overboard to base your decision on this performance difference. One might even argue that the try-catch solution will perform BETTER since it only requires hitting the file system once instead of twice.