Automating Blob Uploads
In my travels I hear about a lot of patterns of cloud computing. One that I hear frequently is “upload a file and process it”. This file could be financial data, pictures, video, electric meter readings, and lots more. It’s so common, and yet when I Bing (yes, Bing ) for blob uploaders, they’re all manual.
Now, I’m sure that lots of people have tackled this problem for their own needs, but I thought I’d put something together and share it with the community. Both client and server source code are here.
Architecture View
Process Flow
Sorry for the eye chart. I’ll try to explain (in order of the basic system flow.)
UI Thread
A FileSystemWatcher detects new files in a directory in the local file store. A tracking record is inserted into a SQL Express tracking DB.
Process Files Thread
If it’s a big enough file that the upload is attempted before the file is fully copied into the directory, the thread waits by attempting a write lock. The file is uploaded to Azure blob storage using parallel blocks. The block size and number of simultaneous blocks is configurable using the StorageClient API. Once the file is uploaded, a record is placed into the Notification Queue and the tracking record is updated.
Azure Worker
The Azure worker detects the notification, determines that the blob is “ok” in some way (application specific), acknowledges receipt (in order to free up client resources), processes the blob (or sets it aside for later), then deletes the notification message.
Process Acknowledgements Thread
Receipt of the acknowledgement message triggers deletion of the file in the upload directory and update of the tracking record. Then the acknowledgement message is deleted.
What could go wrong?
In a manual upload scenario, so many controls would get in the way. Automating the process demands that care is taken to ensure fault tolerance. Here are some things that can go wrong and how the architecture protects against them:
- Azure worker is unavailable. (Perhaps it’s being upgraded, it’s very busy, it has a specific window of time that it runs.)
By placing the data into blob storage and creating the notification record, the system is unaffected by absence of the worker. If you need more workers to process the data, they can easily be created and will immediately start taking load. This procedure can be automated or manual. - Azure worker fails. (Hardware? Program bug?)
The last step in the process is deleting the notification message. The message will re-appear in the queue for processing by another worker in this case. You should check the number of times the message is de-queued to ensure the message or data in themselves aren’t the problem. - Upload fails. (Network connection?)
The upload is protected in a couple of ways. By uploading in blocks, if your upload fails in the middle you only need to upload the missing blocks. The last block that’s uploaded stitches the component blocks together. Also, the last step in the upload process is update of the tracking record. If the tracking record is not updated, the upload will be attempted again at a later time. - Client resource stretched. (Too many files in the directory.)
Files to be uploaded can come from many sources and perhaps 100’s at a time. By watching for acknowledgement of receipt from the server side the program can recycle this resource effectively. This is the main reason the program was written as a set of cooperating threads – so that a long upload wouldn’t prevent keeping the directory cleaned up.
Alternatives to the Process
Following on #4 above, if there’s lots of processing to be done on each data file (blob), but there’s also lots of files to be uploaded, you can run into trouble. If you don’t have enough workers running (costs more), files could back up on the client and potentially cause problems. By splitting the task load up, you will be able to acknowledge receipt of files quicker and clean up on the client more frequently. Just be sure you’re running enough workers to get the overall work done during your window of opportunity.
What about those queues?
Yes, you could use SQL Azure tables to manage your notifications and acknowledgements. Queues have the advantage of being highly available and highly accessible in parallel via HTTP, though this comes at a cost. If you have millions of files to process, these costs are worth considering. On the other hand, presumably your SQL Azure database will be busy with other work and you don’t want to load it down. Also, if you have lots of customers you would need to either wrap access to SQL Azure behind a web service or open its firewall to them all.
What about the FileSystemWatcher?
FSW has a buffer to hold events while they’re being processed by your code. This buffer can be expanded, but not infinitely. So you need to keep the code in your event logic to a minimum. If large numbers of files are being dropped into the upload directory, you can overwhelm the buffer. In a case like this it might make sense to set up multiple incoming directories, multiple upload programs, etc. An alternative to FSW is enumerating files, but this can be slow.
As always, I’m interested in your thoughts. Comment freely here or send a mail. Full source for both client and server are here.
Anonymous
June 11, 2011
If a file comes in while the FSW is not active, I believe the file will be missed. If that's true, then you have to be sure the code with the FSW is always running, or you need a catch-up mechanism. This is perhaps the one advantage of enumerating files. Good post, thanks.Anonymous
June 13, 2011
@BillBak - Great point, Bill, and thank you for sharing it. To all you eagle eyes, there's another problem with this approach that I'm dealing with in a follow up. Anyone care to weigh in?