What's that awful directory name under WindowsWinSxS?

As the Visual C++ Runtimes version 8.0 is now a side-by-side component, you may have seen what looks like an unreasonably complexly named path from which parts of the CRT are loaded. "Golly, what can they possibly be thinking - creating a directory whose name is full of underscores, numbers, and dots?" The good news is, we definitely were thinking something. You may or may not have ever peeked into the %windir%\winsxs directory on your system. If you haven't, now would be a good time. First thing you'll notice is that there are a lot of those funkily-named directories. You might further notice that there seem to be several that differ only by what looks like a version number and some random-looking eight characters on the end of the name. Next you might see that some of them differ only by the second-to-last stringish thing. Lastly, note that mostly, the strings can be deciphered with a little help.

In the component world, each component has what's called an identity. This is the unique name of the component, generated by the component author and referred-to in manifests and user interfaces. No two components have exactly the same set of properties; if they did, even if the file contents were different, they would be considered the same component. (Note: the CLR abuses this rule and often ships new bits under old identities - that's outside the scope of today's missive.) There's a whole set of rules around how identities work, which I may get to at some point. The key thing is that identities are basically property bags of string triplets - namespace, name, and value - for each attribute. Those attributes in the bag without a namespace are called well known attributes, and there are only a few of those (only Microsoft is currently allowed to define new ones...). Further, certain well-known attributes have rules around their values - the version attribute has to be a dotted-quad version, the public key token attribute has to be a string of hex digits of nonzero but even length. Other well-known attributes like name can be whatever you like - "Foo:Bar:Bas" is ok, as is just "Q".

Each shared component (in the winsxs directory) gets its own directory into which its payload bits are placed. Somehow, we have to generate (mostly) unique & repeatable directory names for this purpose. The requirements of directory names are reasonably simple - can't overall be more than MAX_PATH (260) characters, can't contain certain characters, etc. Given the naming requirement, it was impossible to use the entire identity as the name of the directory, as someone could name their component "foo\bar" and mess things up. With the extensibility requirement for identities themselves, we couldn't possibly use the entire identity, as the set of tuples would end up being far longer than MAX_PATH. Most importantly, we wanted the directory names to be readable to your average administrator or PSS representative. Finally, generation of the keyform from an identity had to be fast.

So, Mike Grier came up with the idea for a key form of identities. This key form would be a reasonably-unique one-way noncryptographic readable representation of the major defining attributes of an identity. What he ended up with was the following:

proc-arch_name_public-key-token_version_culture_hash

The italicized strings (except for the hash) are replaced by the values from the identity for their respective properties. If the property was unset, then "none" was put in it place. In the identity model name, processor architectureand culture are allowed to have very laxly-validated contents, so they may contain "unfriendly" characters that have to be filtered. Characters not in the group "A-Za-z0-9.\-_" are removed from the attribute value before being written into the string. Certain attributes have upper-limits places on their values (nameis limited to 64 characters, processor architecture to 8, culture to 12) achieved by dropping characters from the middle of the filtered string and replacing them with "..". Finally, the whole string is lower-cased using a clone of the unicode casing table that shipped in Windows XP RTM.

Voila! A string representing the identity that's filesystem friendly!

But wait... what about all those characters that got dropped? Couldn't I construct an identity whose keyform matched the keyform of another? Yes, if it weren't for the _hash value on the end of the keyform. This hash (not in the cryptographic sense) is of all the namespaces, names and values of properties in the identity. Anything that didn't appear in the keyform text would have been represented in this hash. The two identities whose names are "Foo!" and "Foo?" will generate different keyforms - while the ! and ? were dropped from the keyform, they still appear in the hash. A coworker did some experiments and determined that while it was possible to reset the hash generation function, it would involve a ridiculous amount of work.

The algorithm for generating the keyform overall (especially the hash) is undocumented at this time. Not because we're trying for security through obscurity, but because the keyform is merely an implementation artifact at this point. Maybe someday we'll lose our heads entirely and store the component payloads in a database of some sort, in a compound storage document, CAB file, whatever.  Also, the algorithm has changed for Vista, and no "normal" use cases for knowing the algorithm exist. If you're trying to find files in the WinSxS directory, you should be using the CreateActCtx/ActivateActCtx/SearchPath set of APIs. If you're trying to write files into the WinSxS directory, you should be using MSI which knows about installing components into the right places. If you're writing your own binder, don't - it's really hard to get right. It should be sufficient to say "the generation of this string is opaque and must be assumed to change."

But wait... what stops Evil Bob from creating a component with the exact same name and overwriting what Nice Jill had already shipped? Suffice it to say, that public key token has something to do with it - I'll explain that next time, when I talk about signing catalogs for side-by-side components.