Performance Quiz #12 -- The Cost of a Good Hash -- Some Help

Artikel
01/24/2007

I continue to be astounded by what my readers can come up with.

As usual I had a purpose for posing my last question and that purpose was to show that basically its hard to get a handle on what things cost in any kind of omnibus way. But imagine my surprise when Frank and Shuggy turned this into an interesting discussion of hashing techniques and combining hashes. To say nothing of some of the other regulars (you know who you are) that have insightful comments. Those comments are definately worth a read and thanks very much for presenting them!

But I don't know that anyone has answered my question quite squarely (except maybe Alois) and I think this makes my point that it is hard to answer this question.

But what if you were to download this file... I think you'd find it quite helpful.

What you see is a compilation of one cost metric for the entire framework. It's not the only cost metric by any means, you could imagine many others, but it is an interesting one. The cost is computed by doing a static analysis of all allocations made in the call tree of each method. It's the metric that I discussed in this article on performance signatures including the exception for methods that look like they do a one-time allocation. The cost is logarithmic, base 10, offset by 1, i.e.

if (allocation_count == 0)
cost = 0;
else
cost = 1+log(allocation_count)

And I'm only reporting one digit after the decimal because I want to give the idea that this is just a rough cost, it's not supposed to be super-precise it's just supposed to give you a rough idea of whether or not a method is very complex or not so much.

I'm trying to think of a good name for this metric, so far I'm thinking something like Allocation Complexity but I'm open to better ideas.

Anyway snoop the file then see if you can answer my question. For bonus points look at some other low level functions that might be interesting like say Compare, Equals, or GetEnumerator. It's very intersting to see that some methods that seem like they should be very low level are actually quite high level.

I think costs like this one can be very useful in making decisions about which framework features to use in which contexts.

Again this is only one kind of cost and its only approximate but I think it's interesting nonetheless.

Now would you like to try the original question again:

Q: Name five implementations of GetHashCode in the framework that do things that you wouldn't expect a hashing function to do.

Comments

Anonymous
January 24, 2007
The comment has been removed
Anonymous
January 24, 2007
Oh yeah, and System.Uri, which I mentioned in my comment yesterday as having unexpected GetHashCode behavior, scores a 3.8 and may also take a lock while establishing the completeness of the URI infoset and distinguishing remote URLs for non-absolute URIs.
Anonymous
January 24, 2007
That's exactly the kinds of things I'm talking about Eric. What do you think of the little costs file?
Anonymous
January 24, 2007
Little? ;) I love having this information. I'll probably spend way too much time querying and pivoting it, then reflecting on the methods themselves for details, to get a better understanding of the framework's (oft used) heavy methods. This knowledge, or whatever part of it I retain, may then bubble up in real coding decisions I make down the road. However, what I'd really like to do is statically analyze my own projects in a similar way, perhaps using this information to short circuit reflection-based analysis of framework methods that my code calls. Or, even better, make this information available closer to where it's needed: while we're coding. I'd love to see a weight indicator/value next to methods in Intellisense and/or in help (both for the framework and my own projects, though I can appreciate that real-time generation of the latter may not be feasible). Of course, as you pointed out, this is just one particular measure of "heavy" - so how about generating cost measurements in a few different ways, shipping the values with the framework, and integrating it with VS/Intellisense/help? Just some thoughts.
Anonymous
January 24, 2007
Did someone manage to convert the .txt file to excel format? Please post a link. And sorry Rico for hijacking your previous thread.
Anonymous
January 24, 2007
The comment has been removed
Anonymous
January 24, 2007
I didn't really expect GetHashCode for Double to get the raw bits and turn them into an integer. The performance is great, though.
Anonymous
January 24, 2007
For the lazy guys : 4.2 com.ms.lang.MulticastDelegate 4.2 com.ms.lang.Delegate 4.2 java.lang.reflect.Method 3.8 System.Security.Policy.HashMembershipCondition 3.8 System.Uri 3.5 com.ms.wfc.ui.Color 3.5 java.lang.reflect.Field 3.5 com.ms.wfc.ui.Brush 3.5 com.ms.wfc.ui.Font 3.5 com.ms.wfc.ui.Region 3.5 com.ms.wfc.ui.Pen 3.5 System.Security.Policy.CodeGroup 3.5 System.Security.Policy.NetCodeGroup 3.5 System.Security.Policy.FileCodeGroup 3.3 System.Security.PermissionSet 3.3 System.Security.NamedPermissionSet 3.1 System.Configuration.SettingElement 3 System.Configuration.ConfigurationElement 3 System.Diagnostics.ListenerElement 2.8 System.Web.Configuration.TagPrefixInfo 2.8 java.math.BigDecimal 2.4 System.Web.Configuration.BuildProvider 2.4 System.Web.Configuration.CustomError 2.4 System.Web.Configuration.ProfileGroupSettings 2.4 System.Web.Configuration.TransformerInfo 2.4 System.Web.Configuration.TagMapInfo 2.1 System.Web.Configuration.NamespaceInfo 2 System.Security.Policy.StrongNameMembershipCondition 1.9 System.Xml.Serialization.CaseInsensitiveKeyComparer.System.Collections.IEqualityComparer(System.Object) 1.7 System.Security.Policy.PublisherMembershipCondition 1.6 System.Xml.Schema.KeySequence 1.6 System.Data.SqlTypes.SqlDecimal 1.6 System.Data.SqlTypes.SqlString 1.5 java.text.CollationKey 1.5 System.Security.Policy.UrlMembershipCondition 1.5 System.Security.Policy.SiteMembershipCondition 1.5 System.Globalization.SortKey 1.3 System.Management.ManagementBaseObject 1.3 System.Net.Cookie 1 System.Security.AccessControl.GenericAce 1 System.Web.Configuration.RootProfilePropertySettingsCollection 1 System.Web.Caching.CachedVary 1 java.math.BigInteger 1 System.Security.Cryptography.X509Certificates.X509CertificateCollection 1 System.Web.UI.ControlCachedVary 1 System.Windows.Forms.DataGridView+HitTestInfo 1 System.Web.FileSecurity+DaclComparer.System.Collections.IEqualityComparer(System.Object) 1 System.Web.UI.AttributeCollection 1 System.Windows.Forms.DataGridViewAdvancedBorderStyle 1 System.Drawing.FontFamily 1 System.Configuration.ConfigurationElementCollection 1 System.Windows.Forms.DataGridViewCellStyle 1 System.Windows.Forms.TableLayoutPanelCellPosition
Anonymous
January 25, 2007
The comment has been removed
Anonymous
January 25, 2007
Coconut I interpreted that as 1000 allocations of any sort though you could weight it by the size of the allocation I suspect that is impossible to do without runtime introspection since the most likely large allocations will be arrays with dynamic sizes.
Anonymous
January 25, 2007
Oh yes - sorry for the hijaking too (but Frank's links were very informative :)
Anonymous
January 25, 2007
The comment has been removed
Anonymous
January 25, 2007
The comment has been removed
Anonymous
January 25, 2007
Forgot some code: floatBuffer[0] = value; uint iv = Buffer.GetByte(floatBuffer, 0); iv |= (uint)Buffer.GetByte(floatBuffer, 1) << 8; iv |= (uint)Buffer.GetByte(floatBuffer, 2) << 16; iv |= (uint)Buffer.GetByte(floatBuffer, 3) << 24;
Anonymous
January 25, 2007
Am trying to upload this data to "Many Eyes" but it appears they're a bit sleepy right now :( .....
Anonymous
January 25, 2007
The comment has been removed
Anonymous
January 25, 2007
It is interesting that String is not immutable after all. But given the consequences of that, why didn't StringBuilder use its own character buffer, and a String internal constructor to turn that into a regular String (releasing the buffer memory to the string and allocating a new one when needed).
Anonymous
January 25, 2007
Would be interested in anyone else visualisation of the log10 data (which you can find on this site) http://services.alphaworks.ibm.com/manyeyes/view/SMGTJEsOtha60H-_83oKE2- I cut off the cost at 6.3, but perhaps that should have been higher..?
Anonymous
January 25, 2007
Frank: The string function: string GetStringForStringBuilder(string value, int startIndex, int length, int capacity) does explain that. The usage pattern of StringBuilder does not say anything about several ToString() calls. If you empty StringBuilder from time to time by calling ToString you have to copy it every time back when the next volatile operation happens which would need quite a bit of redundant prechecks for every StringBuilder operation in the main code path. This could become quite expensive when you have to add guards to take care multithreading issues. I think for the sake of performance/simplicity you have make sacrifaces from time to time. Yours, Alois Kraus
Anonymous
January 25, 2007
Hello Alois, The main usage pattern for StringBuilder is probably several (or many) Appends, followed by a single ToString(). While it is true it would add one conditional to each Append operation to determine if the current buffer is mutable or a reference to an immutable string, it would not seem to add much over the exising Append code, and would free up memory in potentially millions of strings, since the String.m_stringLength field would no longer be needed. Regards, Frank
Anonymous
January 25, 2007
You meant the m_arrayLength? Hm not sure about that. Perhaps you still need somewhere the allocated array size when you release its memory. I have not (yet) looked into the GC logic how it does keep track of string memory allocations. Yours, Alois Kraus
Anonymous
January 26, 2007
Frank - I think you may be forgetting that in .Net it is legal to have a string containing null characters... How are you going to determine the length of such a string... If you fix that by forcing the char array to be the exact length then the stringbuilder has to do far more effort to create the strings in terms of copying them to a new array of the correct size if it was not initialized with the correct capacity
Anonymous
January 26, 2007
Shuggy: Please note that String has actually 3 members.

String Length
Char Array Capacity
First Char of Array I think Frank wanted to get rid of the Array capacity since in his model the string length is always exact. But if you allocate an array of a certain size how do you deallocate it if you do not store the array size somewhere? Either has the GC a fourth member with the array size or he does use the array capacity from string directly. But I do not know how the deallocation part is exactly handled. Yours, Alois Kraus

Anonymous
January 26, 2007
Correct. With the length only a pointer to the string characters is necessary (or embed in the object). Truly immutable strings are smaller/faster; fields can be readonly. I assumed the GC already stores other information, the true size of the string memory in bytes, in a per object header, as it is likely to do for all objects in the CLR heap. If the StringBuilder instead copied bytes to a new string memory in ToString, there would be savings by preventing live strings padded with empty space in the heap. Fastest would be to leave the empty space in the string and have the GC compact strings as well as move objects. I don't think the GC does that currently.

Regards, Frank

Anonymous
January 26, 2007
The CLR does not store the size of a string somewhere in a seperate per object header (what would be the point). The cost would be enormous given that the object header on all objects is (size and bitmask wise) identical otherwise. Strings are like arrays in that they are special types which can have arbitrary size. IIRC string and array are the only such types and must be special cased in the GC for this very reason. Only strings and arrays need this additional data and so it is (sensibly) stored inline with the object itself. On a side note having strings become 'simple' objects where they are quite simply a pointer to a char array would exhibit far worse performance in most cases since every acess to the characters in the string would involve an arbitrary pointer jump. Awful on modern cache hirearchies though I doubt you meant to imply this was a good idea. The cost of a live string with empty bytes with a simple doubling allocation in the stringbuilder will be at most N and likely average better. The cost of the copy is always N. I would guess that the overhead is lower in the general case (measuring of course :). If the majority of strings die young (as you would expect from the majority of use cases) then the avoidance of the copy will outperform (since the copy requires more memory in the shortterm upping level0 garbage collects as well) If the consumer of the class needs to avoid any wasteage on large strings they can easily set stringBuilder.Capacity = stringBuilder.Length and do it themselves (for the cost of the copy) Forcing all users to take the copy would I believe be far worse. As you say there is nothing stopping the compacting GC from shrinking a string as it copies it (it might even be faster)
Anonymous
January 26, 2007
Alois - sorry if I wasn't clear but that was exactly the point I was making. either you store two values or you force them to be the same. the consequences of the later are far reaching for the StringBuilder use case. They also prevent substrings reusing the underlying character sequence for open ended* (which is fine in a compacting GC because the GC simply ignores non live parts)
i.e. 4th character to the end
Anonymous
January 26, 2007
I love Rico’s performance quizzes in general, but the last one has something especially interesting:
Anonymous
January 26, 2007
Hi shuggy, You explained clearly why there should be no copy, thanks. In StringBuilder.ToString the buffer memory is released, not with a special string ctor but by returning the buffer. Instead the new string could at that point made exact size, without a copy or allocation, by changing the length field to match exactly, keeping capcity in StringBuilder instead. All strings could use a single length field, instead of length and capacity. There would be wasted space in the heap after ToString at that point, since more was allocated than needed, but assuming the GC looks at the string length it can compact away the wastage. Looking at StringBuilder.ToString() we see logic to signal the buffer is now immutable: public override string ToString() { string text1 = this.m_StringValue; if (this.m_currentThread != Thread.InternalGetCurrentThread()) { return string.InternalCopy(text1); } if ((2 * text1.Length) < text1.ArrayLength) { return string.InternalCopy(text1); } text1.ClearPostNullChar(); this.m_currentThread = IntPtr.Zero; return text1; } The immutable buffer signal is m_currentThread set to IntPtr.Zero. A subsequent Append detects this signal and allocates a new buffer: public StringBuilder Append(string value) { if (value != null) { string text1 = this.m_StringValue; IntPtr ptr1 = Thread.InternalGetCurrentThread(); if (this.m_currentThread != ptr1) { text1 = string.GetStringForStringBuilder(text1, text1.Capacity); } int num1 = text1.Length; int num2 = num1 + value.Length; if (this.NeedsAllocation(text1, num2)) { string text2 = this.GetNewString(text1, num2); text2.AppendInPlace(value, num1); this.ReplaceString(ptr1, text2); } else { text1.AppendInPlace(value, num1); this.ReplaceString(ptr1, text1); } } return this; } The new buffer is assigned in ReplaceString: private void ReplaceString(IntPtr tid, string value) { this.m_currentThread = tid; this.m_StringValue = (volatile string) value; } The question remains, why does every string have the overhead of a capacity field, if only StringBuilders make use of it, and that use could be eliminated. Surely there is another reason? Alois, I did not understand the substring comment. Regards, Frank
Anonymous
January 27, 2007
Rico Mariani included a very interesting file in his latest performance quiz post which shows all the
Anonymous
January 27, 2007
The comment has been removed
Anonymous
January 28, 2007
Hi Shuggy, StringBuilder is not thread-safe, but I think you say if String.AppendInPlace occurred simultaneously with ToString, with ToString shortening string.m_stringLength, the GC thread might compact away space after the string at the same time as AppendInPlace, overwriting memory for a moved object. StringBuilder choices then:

separate buffer and a copy on ToString (temporarily more space before GC compaction)
synchronization between GC and StringBuilder buffers, preventing append overwrite scenario
every string is a kind of string builder with capacity (current situation) Thanks, Frank

Anonymous
January 28, 2007
Sorry I wasn't implying StringBuilder was thread safe in general, nor thread safe in terms of serializing calls. Just thread safe in terms of memory ownership as you say (if this happened in the large object heap you mightn't get the space back as no compaction occurs there)

Dela via

Performance Quiz #12 -- The Cost of a Good Hash -- Some Help

Comments

Ytterligare resurser