Working with Zip Files in .NET [Richard Lee]
Before getting started, I’ll introduce myself. My name is Richard Lee, and I’m a developer intern on the BCL for the summer. I’ve only been here for a few weeks, but it’s been great working here. The people, the environment, and my project are all great. Speaking of which, my project is to add general purpose .NET APIs for reading and writing Zip files, which we’re considering adding to the next version of the .NET Framework.
The most common Zip tasks are extracting to a directory and archiving a directory. For these mainline scenarios, we have static convenience methods. The following code takes all of the files in the Zip file, photos.zip, and extracts them to a folder on the file system:
ZipArchive.ExtractToDirectory("photos.zip", @"photos\summer2010");
This code does the reverse, putting all of the files in the folder into the Zip file:
ZipArchive.CreateFromDirectory(@"docs\attach", "attachment.zip");
For more sophisticated manipulations of Zip archives, there are two main classes. ZipArchive represents a zip archive, which is a collection of entries, and ZipArchiveEntry represents an archived file entry. The following code extracts only text files from the given archive.
using (var archive = new ZipArchive("data.zip"))
{
foreach (var entry in archive.Entries)
{
if (entry.FullName.EndsWith(".txt", StringComparison.OrdinalIgnoreCase))
{
entry.ExtractToFile(Path.Combine(directory, entry.FullName));
}
}
}
Zip archives can also be created on-the-fly. This example creates a new archive with a readme file that is created without the need for a corresponding file on disk, and a file from the file system.
using (var archive = new ZipArchive("new.zip", ZipArchiveMode.Create))
{
var readmeEntry = archive.CreateEntry("Readme.txt");
using (var writer = new StreamWriter(readmeEntry.Open()))
{
writer.WriteLine("Included files: ");
writer.WriteLine("data.dat");
}
archive.CreateEntryFromFile("data.dat", "data.dat");
}
The ZipArchive class supports three modes:
- In Read mode, data is read from the file on demand, using only a small buffer.
- In Create mode, data is written directly to the file using only a small buffer. Only one entry may be held open for writing at a time.
- In Update mode it is possible to read and write from existing archives, as well as rename or delete entries. This mode requires loading the entire archive into memory, and as such we recommend that it be used only with small archives when this functionality is needed.
Below is our current thinking on what the public API listing will look like (note that this hasn’t been finalized yet).
namespace System.IO.Compression
{
public enum ZipArchiveMode { Read, Create, Update }
public class ZipArchive : IDisposable {
// Constructors
public ZipArchive(String path);
public ZipArchive(String path, ZipArchiveMode mode);
public ZipArchive(Stream stream);
public ZipArchive(Stream stream, ZipArchiveMode mode);
public ZipArchive(Stream stream, ZipArchiveMode mode, Boolean leaveOpen);
// Properties
public ReadOnlyCollection<ZipArchiveEntry> Entries { get; }
public ZipArchiveMode Mode { get; }
// Instance methods
public ZipArchiveEntry GetEntry(String entryName);
public ZipArchiveEntry CreateEntry(String entryName);
public void Dispose();
protected virtual void Dispose(Boolean disposing);
public override String ToString();
// Instance convenience methods
public ZipArchiveEntry CreateEntryFromFile(String sourceFileName, String entryName);
public void ExtractToDirectory(String destinationDirectoryName);
// Static convenience methods
public static void CreateFromDirectory(String sourceDirectoryName, String destinationArchive);
public static void CreateFromDirectory(String sourceDirectoryName, String destinationArchive, Boolean includeBaseDirectory);
public static void ExtractToDirectory(String sourceArchive, String destinationDirectoryName);
}
public class ZipArchiveEntry {
// Properties
public DateTimeOffset LastWriteTime { get; set; }
public String FullName { get; }
public String Name { get; }
public Int64 Length { get; }
public Int64 CompressedLength { get; }
public ZipArchive Archive { get; }
// Methods
public Stream Open();
public void Delete();
public void MoveTo(String destinationEntryName);
// Convenience methods
public void ExtractToFile(String destinationFileName);
public void ExtractToFile(String destinationFileName, Boolean overwrite);
public override String ToString();
}
}
We would love to hear what you think of the APIs so far, and how you plan on using them.
Comments
Anonymous
June 28, 2010
I would make the API based on Stream objects instead of working with files directly. There are a lot of cases where you need to zip something in memory, e.g. for writing to a database or to a network socket.Anonymous
June 28, 2010
It's unclear from the API whether you'd be able to add an empty folder to the zip file, which could be useful. I second being able to feed a stream into an entry in addition to a file.Anonymous
June 28, 2010
The comment has been removedAnonymous
June 28, 2010
It's not clear to me what ZipArchiveEntry::MoveTo() is used for, is it used to reuse the object, but point it at a different file in the archive? That seems messy to me, and problematic from a usage standpoint.Anonymous
June 28, 2010
I definitely +1 the base class or interface suggestion. Zip archives are great and widely supported but it'd be nice to have the flexibility to code implementations for other formats and have a common base. Also, I think ZipArchiveEntry.Open should take some kind of FileMode flag to specify what kind of operations are to be supported. Other than that, great! It'll be a nice addition to the framework.Anonymous
June 28, 2010
I also agree with the stream comments. Also, will the ZipArchive class support multiple zip compression methods, or will this be a “Compressed Folder”? It would be nice if it supported multiple methods and the ability to specify them, including the support of zip encryption methods. I work in health care, and this would be a great feature of the API to help comply with security regulations.Anonymous
June 28, 2010
SharpZipLib is pretty good, but it'd be great to have structured ZIP support as part of the framework (obviously stream based as above comments; file stuff could just be convenience/extension methods).Anonymous
June 28, 2010
The comment has been removedAnonymous
June 28, 2010
Please, please don't forget testability. The base class / interface suggestion would be ideal - if you're hard set on static convenience methods just don't lock the rest of us out that unit test our code heavily and use mocking frameworks. It would be nice, for once, to be able to use a BCL offering without having to wrap it in something that allows me to test my code without resorting to things like TypeMock - example: DateTime.UtcNow. Also, thanks for being transparent about the design process - good to see!Anonymous
June 28, 2010
sevenziplib.codeplex.com Although I haven't looked into it, their examples seem to have the right "feel" to them. I do like the fact that LINQ can be used. I second (third, fourth, whatever) an open architecture so that concepts like finding the internal directory, encoding and decoding the contents of an elements, and so on, are exposed. IOW, have a CompressedArchive abstract class, of which traditional .zip files are processed by one set of child classes, but other formats (.cab, .rar, etc) could also be implemented (presumably by third parties) and then processed with the same set of APIs. Oh, and make sure that long (32K) path names inside the files are supported. Also, you don't say what exceptions you might throw. For example, what if the CRC (which you don't have an API to retrieve) doesn't match? And how does CRC mismatch fit into the suggestion for an in-memory Stream retrieval of the data? As much as I like the Stream idea, I don't like the idea of getting to the end of the stream (having updated files/databases/whatever), then finding out that the CRC didn't match and perhaps all the data up to then was flawed.Anonymous
June 28, 2010
The ExtractToFile 'convenience' methods on ZipArchiveEntry seem out of place and overly specific to me. Personally I'd rather see a more generalized utility methods on the System.IO.File static class, e.g. File.WriteAllBytes(string path, Stream stream) // reads from 'stream' and writes to the file at 'path' File.WriteAllBytes(string path, Stream stream, FileMode mode) Which could then be used to replace zipArchiveEntry.ExtractToFile(filename) as follows: using (Stream s = zipArchiveEntry.Open()) { File.WriteAllBytes(filename, s); }Anonymous
June 28, 2010
This approach is too specific because there are plenty of other formats: en.wikipedia.org/.../List_of_archive_formats It would make more sense if ZipArchive is a subclass of Archive.Anonymous
June 28, 2010
I agree with @Dominik -> having an IArchive (or even Archive abstract base class) that ZipArchive, 7zArchive, TarBallArchive, RarArchive (and others) can all implement would be great. Also, something like a CompressedStream might be needed for archives that only have a single stream.Anonymous
June 28, 2010
I strongly concur about the suggestions for:making it more Stream-based rather than File-based;
deriving from an abstract Archive class (you can choose a better name). Since zip files store file names as an array of bytes whereas .Net uses utf-16 for strings, I believe you should specify how you will handle encodings and provide a way to override the defaults (e.g. force encoding of file names to utf-8, or windows-1252). Please check your compression ratio compared to other zip implementations. The current DeflateStream class might need some improvements. I'm not asking that you match 7zip's deflate or kzip, but you really should compress as well as e.g. Info-zip (with -9). Convenience APIs for creating a Package from a ZipArchive (or, better, from an abstract Archive class) would be nice. A set of PowerShell cmdlets would be nice, too.
Anonymous
June 28, 2010
The comment has been removedAnonymous
June 28, 2010
The comment has been removedAnonymous
June 28, 2010
The comment has been removedAnonymous
June 28, 2010
The API should be extensible to support multiple formats so that we can have interoperability with third party compression utilities. I would imagine something along the lines of the Cryptography APIs that support some standard algorithms but allow others to be provided.Anonymous
June 28, 2010
Nice work. I was wondering if you're considering the support of more advanced features such as zipping for optimal size, for optimal speed (just store, don't compress). Or splitting in several files. This could be achieved by an additional overload receiving an instance of a ZipArchiveSettings class where you'ld specify these kind of options. I also like the idea of making it more stream-based. Concerning the support for other archive types such as 7zip or rar, I understand that it may fall a bit out of the immediate scope of your project but creating the previously mentioned hierarchy will reduce the need of big refactorings if this feature ever gets requested. Good luck with your project!Anonymous
June 28, 2010
Looks like a great start on a new API for ZIP files. Thanks for posting! I would suggest that both options should exist: to use File/Folder objects, as well as Stream objects. Cheers, Trevor SullivanAnonymous
June 29, 2010
I agree with all who suggested additional formats especially 7zip (LZMA?) and RAR - although I understand including RAR functionality may be difficult due to licensing issues.Anonymous
June 29, 2010
Just wanted to note that the API as described does allow you to extract to a stream. There's no helper method for that, similar to ExtractToFile, but you can get a ZipArchiveEntry, call Open() on that to get a stream, and then use Stream.CopyTo() to extract it to another stream (or just Read() and process in some other way). Similarly, the constructors for ZipArchive also include Stream overloads. So it would seem that you can use the API entirely over your own custom streams, both for input and for output, with no on-disk files involved at all. It's just that working with files is slightly more convenient due to helpers.Anonymous
June 29, 2010
It seems SharpZipLib is the most popular choice to work with ZIP archives. However, I find it quite lacking in feature set and API realization. DotNetZip however is pretty sweet. Makes everything I want however I want and it's APIs are clean and easy to use. Fast too.Anonymous
June 30, 2010
Well, I like the idea and the API you designed - good work! For me, there are two open issues:Rename the Dispose method to Close, which is more BCL-like, I think.
Also add support for .cab files. Even if you don't directly implement it for the next version, designing it and writing a prototype implementation can help you design a class hierarchy like it has been requested by other commenters. Thank you very much and all the very best for your mid and final reviews! Ooh
Anonymous
July 01, 2010
Finally! And the API looks usable. One thing, since a ZIP is conceptually very much like a folder, it would be great if it would be in netfx as well. So to be more concrete, it could offer more of the functionality that System.IO.Directory offers. For example listing files using a wildcard. Regards and keep it up. DavidAnonymous
July 01, 2010
Sounds like this is something that would be good to put on Codeplex early, so that people can try it out and offer feedback before you ship it in the BCL and permanently freeze the API. One more comment: For us, speed is vital, because we zip and unzip large quantities of data. Currently we P/Invoke to an unmanaged library for zipping, and use a managed library for unzipping, just because that's the fastest combination. If the BCL zip library wasn't faster than what we have now, we wouldn't get any benefit from it.Anonymous
July 02, 2010
The comment has been removedAnonymous
July 02, 2010
it would be nice if you could support PKWARE AES 256 encryption. maybe you could create an ArchiveStorage class similar to the isolated storage class (msdn.microsoft.com/.../system.io.isolatedstorage.isolatedstoragefile_members.aspx).Anonymous
July 04, 2010
Wanted (and great for testing now and later if more format will be added): archiveStreamsCollection.Test() method which opens the stream, compares the header to currently supported archive types (initially just ZIP later others possibly) and if the stream contains a ZIP then it reads if in such way that it extracts the data but throws it away not consuming memory/disk while calculating checksum for the files inside. When the Test() is running, if the user wants the checksums for each file and their status whether the particular file was determined to be broken, a callback can be subscribed that provides this information. The API should be user extendable, so that if we have Proprietary Format archive, we can just call some method to add our format under the same Test() so while MS only supports ZIP, now we can use it to Test/Extract to memory both ZIP + any number of formats we have added support for through a wrapper. Nice to have: Zip format actually has changed a bit I believe. If you try to extract some zips from 15 years ago it might not work. I have no idea how much it changed but if it's not too much effort to support the older zips as well that would be nice. I know this because trying the various .NET zip libraries they failed to extract/test various older zips. The native Info-Zip library which supports older zips as well.Anonymous
July 04, 2010
Interestingly, it seems like there are 2 (or more probably, but generally 2) schools of thought on zip libraries, which mirrors the 2 schools of thought on zip in windows more generally. There is the old-school camp which come from the days when the focus was on the structure archive file(s); and there is the new-school camp which has bought into the Microsoft abstraction of a zip archive as just like a folder (as it can be viewed in Windows these days). But in Windows it is still possible to install 7-zip or WinZip or WinRar or whatever and use archives in the old school way. So presumably you might want to design an API (or APIs) that can accommodate both schools of thought. I'll second (or third) the idea that the first focus here should be on providing support for compression within the existing streams interfaces and pattern. I would suggest having a look at CryptoStream in System.Security.Cryptography - this is an area of the framework that has already accomplished this in another area. This is also analogous because the CryptoStream can work with many "providers" which give different crypto implementations - so if you mirror the pattern you can have a CompressionStream which can also take one of several providers. I'd definitely like to see support for multiple compression providers, specificallly .rar and .7z, including as much of their optional functionality as can be supported, such as multiple-volumes. Thanks!Anonymous
July 05, 2010
Your proposed API looks great. Kudos for subjecting it to public scrutiny too! Areas I'd also like to see covered are: Asynchronous, Cancellable operation (already mentioned by others but I've found it helpful too). Error Handling - What happens when 1 or more files in a directory are inaccessible during an archival operation? It'd be nice to have the option of continuing without interruption though it would also be nice to have access to the reason for failure for each file that failed (e.g., fire an Event in case of an error - this could be cancellable or not but the event args should have the Exception(s) responsible).Anonymous
July 05, 2010
Zip was already implemented in the J# library. Might want to check that out. It would be nice to not have to include that library for my zip projects in the future. ;)Anonymous
July 06, 2010
Thank you for doing this! We have a web app that allows customers to pick their .jpg files and package / download them. It would be great if the API:
- Allowed developer to set compression ratio (would be zero for our case).
- Allow us to pass in a list of strings (into CreateFromDirectory) that represent the names of the files the user selected to zip and download.
Anonymous
July 07, 2010
I just have to suppose that even though its called a "Zip" archive, we can select / use different forms of compression resulting in different types of files. e.g. rar, iso, .7, .001 and others.Anonymous
July 08, 2010
Support for streams is a major requirement. I would recommend checking http://dotnetzip.codeplex.com/ and sevenziplib.codeplex.com projects. Combining features from both would be ideal. I would be happier if the framework would simply use MEF to add compression engines. e.g. 7z, rar, iso and others. I'm also against calling it "ZipArchive", is it going to be for .zip files only? Thats my view.Anonymous
July 19, 2010
The previous comments touched all my pain-points. I'll try to sum them up:base-class/interface hierarchy for supporting other archive formats
stream support
setting compression level
password support
archive splitting
exposing chesksums
async & cancellation support
abstracting an archive as a folder
Anonymous
July 21, 2010
I think you're making the same mistakes of the current zip libraries, like #ziplib, I would go with a fluent interface approach, the basic helpers for extracting and creating are fine, but once you get into more specific scenarios I think a fluent interface would work really good; here at work I'm working on such library my self, using #ziplib as the actual engine, I think it suits it really wellAnonymous
July 21, 2010
I think you're making the same mistakes of the current zip libraries, like #ziplib, I would go with a fluent interface approach, the basic helpers for extracting and creating are fine, but once you get into more specific scenarios I think a fluent interface would work really good; here at work I'm working on such library my self, using #ziplib as the actual engine, I think it suits it really well