Share via


.NET Arrays, IList, Generic Algorithms, and what about STL? [Brian Grunkemeyer]

When we were designing our generic collections classes, one of the things that bothered me was how to write a generic algorithm that would work on both arrays and collections.  To drive generic programming, of course we must make arrays and generic collections as seamless as possible.  It felt that there should be a simple solution to this problem that meant you shouldn't have to write the same code twice, once taking an IList<T> and again taking a T[].  The solution that dawned on me was that arrays needed to implement our generic IList.  We made arrays in V1 implement the non-generic IList, which was rather simple due to the lack of strong typing with IList and our base class for all arrays (System.Array).  What we needed was to do the same thing in a strongly typed way for IList<T>.

There were some restrictions here though - we didn't want to support multidimensional arrays since IList<T> only provides single dimensional accesses.  Also, arrays with non-zero lower bounds are rather strange, and probably wouldn't mesh well with IList<T>, where most people may iterate from 0 to the return from the Count property on that IList.  So, instead of making System.Array implement IList<T>, we made T[] implement IList<T>.  Here, T[] means a single dimensional array with 0 as its lower bound (often called an SZArray internally, but I think Brad wanted to promote the term "vector" publically at one point in time), and the element type is T.  So Int32[] implements IList<Int32>, and String[] implements IList<String>.

What about STL?

I thought a little about C++'s Standard Template Library, and as I looked through some of the samples, I couldn't help but to notice that everything was incredibly iterator-centric.  Rarely did anyone pass around their collection or array, but instead they passed around an iterator that pointed somewhere in that collection, and there were all these gyrations to make iterators allowed you to insert items into the collections at various positions.  It occurred to me that passing the entire collection would be a little simpler. 

To my perspective there are two reasons for iterator-centric focus.  First, most code does simply want to iterate over a collection in one form (this is why VB and later C# have the foreach keyword).  But secondly, STL didn't have the option of making arrays implement an interface.  Arrays in C++ still suffer (and simultaneously somewhat benefit) from their C heritage, where they are simply a pointer to an item, and the language added in syntactic sugar to make indexing legible.  Arrays aren't objects, so you can't add functionality to all arrays in the language.  Also, you run into this strange problem where a pointer to an item can be cast to an array of that item and vice versa.  If you were writing a compiler & trying to figure out which usages of pointers were just pointers vs. which might be arrays at some point in the future, that seems very difficult.  (The great benefit of this is you can allocate your arrays on your stack or take a block of memory mapped from a file and treat it like an array - our managed arrays are GC objects and therefore must be allocated in the managed heap.  Of course, you can use pointers to value types in managed code to work around this to an extent, but it's significantly more inconvenient and you don't get automatic bounds checking, just like in C & C++.) 

I believe this is an instance where we were able to out-innovate C++ the language because of some incredibly deep type system divergences that happened early in our design process, and other language implementations may be able to mimic this as well.  While this leads to a rather different philosophical approach to writing generic algorithms, I think this is a more user-friendly direction than C++ (and our internal working group, Anders, and some CLR architects agreed).  I suspect this is one decision with many subtle consequences that may take a few years to fully appreciate, but I believe it's the right thing for our languages and I haven't been shown otherwise in the past two years. 

What about derived classes and any "array inheritance hierarchy"?

One of the cool features of our arrays is you can cast arrays to other array types, with some limitations, in a type-safe way.  The key notion you need to first understand is that the CLR has two different internal memory layouts for arrays - one for arrays of object references, and one for value types.  The value types are all stored inline, whereas the array of object references is just that - an array of pointers that the GC needs to update.  These two formats are incompatible with each other, in the sense that you'll never be able to cast an array of value types to an Object[].  But once you get beyond that, the general rule for reference types is that if Derived subclasses Base, then Derived[] subclasses Base[].  So you can assign a String[] to an Object[], and you can cast an Object[] to a String[] (it may fail of course if the Object[] wasn't allocated as a String[]).  This covariance leads into some perf hits in assignment into arrays, but that's not too interesting w.r.t. generics.

The complicated part comes when you add in IList<T>.  If Derived subclasses Base, does Derived[] implement IList<Base>?  While saying yes here seems very simple and easy to use & understand, it can lead to some degree of code bloat, in that you may need some representation of IList<Base> in memory.  However, if you don't do this, then you can't make assignments from arrays to IList<T> work without some sort of cast, which really hurts the usability of generic algorithms.  Could you imagine having to write "MySort((IList<Foo>) array)", and then deal with the possibility that the cast may fail if array was passed in from some other caller higher up the stack?  Our current beta implementation does make Derived[] implement IList<Base>, which is what people would expect and vastly reduces the complexity for users.

Hairy Implementation Details

I didn't implement this, but it's an amazingly interesting set of work in the loader, and some somewhat hacky looking C# code.  You might wonder how someone defines generic methods on only some array types, when arrays are generated types that have no corresponding source code.  With IList in V1, we had System.Array as a base class that Int32[] and Object[] subclassed, but we couldn't add the methods to Array itself, because the method implementations wouldn't make sense on anything other than an SZArray.  As an internal implementation detail that no one needs to know, we currently have an SZArrayHelper class that defines all the interesting method bodies, and we pass an array to the method as the "this" pointer.  This leads to some very strange internal code, like this:

 sealed class SZArrayHelper {
 internal int get_Count<T>() {
     //! Warning: "this" is an array, not an SZArrayHelper. See comments above
    //! or you may introduce a security hole!
     T[] _this = this as T[];
     BCLDebug.Assert(_this!= null, "this should be a T[]");
     return _this.Length;
 } }

Fortunately, the CLR team has done all the complex evilness to make the type system work the way you want, and this is a completely internal detail that you don't need to know about.  Sometimes to build an interesting type system in an efficient manner, you need to play some games that would make your computer science professor hunt you down.

The end result of our efforts is that you can call a generic algorithm that consumes IList<T> and pass it a generic collection (like List<String>) or an array (like String[]) and get efficient element accesses with no additional logic forced in your face.  You don't have to make any silly adapter classes or anything like that.

Limitations (Updated)

I had originally posted saying that we made IList<T> appear to be read-only.  The source of this issue was a lack of expressiveness for some databinding scenarios, where a consumer of an IList<T> might query the IsReadOnly property then expect a method like Add or the setter to not throw an exception (which isn't exactly a safe assumption in a general sense - say you have a subclass called NoDuplicatesList<T> (aka Set<T>) and you try to put a duplicate item in the list).  We lost the distinction between a "read-only" vs. "fixed size" list when we intentionally excluded IsFixedSize from IList<T> (I wasn't fully convinced this was a good idea, but others felt the simplicity was worth it). 

As of May 2005, we've changed our minds here.  The usefulness of arrays as IList<T>'s seems more important than the relatively minor consistency issue with the IsReadOnly method, so we'll allow you to set items in the list via the indexer, but Clear will continue to throw (because unlike most other collections, the Count property will not return 0 after calling Clear).  IsReadOnly will continue to return true.  We did lose a little bit by not including the IsFixedSize method on IList<T>, but the cost of adding IsFixedSize back to IList<T> may be too high at this point, and as I said above, may not strictly be sufficient to solve a scenario.  But we believe anyone who really needs to care about this difference can probably test whether their type is an array first, using either T[] or System.Array in their cast checks.  To those that sent in feedback, thanks for raising the issue again.

Comments