LINQ Distinct and IEqualityComparer Implementation
As you may know when working with LINQ there is a method that allows one to return distinct elements from a sequence. Two prototypes are supported:
Distinct<TSource>(IEqualityComparer<TSource>)
Distinct()
Unlike most sequence operations that take some form of a Func parameter, the Distinct() method takes an instance of an IEqualityComparer Type. It is the implementation of this Type that I will be discussing which will hopefully provide some insight into why this approach has been taken.
In a future posting, available here, I will provide a implementation of an Distinct() extension method to IEnumerable, which accepts a Func returning a property value. A performance comparison will also be performed.
I am going to demonstrate the Distinct() method using the following class definition:
public class Product
{
// Key comparison elements
public string Reference { get; set; }
public ProductVersion ProductVersion { get; set; }
// Other attributes not used on comparison
public string Description { get; set; }
// Ensure have a sensible string value
public override string ToString()
{
if (this.ProductVersion == null)
{
return string.Format("Product: {0}", this.Reference ?? "Null Product");
}
else
{
return string.Format("Product: {0}, Version: {1}-{2}-{3}", this.Reference, this.ProductVersion.VersionName, this.ProductVersion.VersionNumber, this.ProductVersion.SubVersionNumber);
}
}
// Derive a unique identifier
public string UniqueIdentifier
{
get
{
if (this.ProductVersion == null)
{
return this.Reference;
}
else
{
return string.Format("{0}:{1}-{2:00}-{3:00}", this.Reference, this.ProductVersion.VersionName, this.ProductVersion.VersionNumber, this.ProductVersion.SubVersionNumber);
}
}
}
}
public class ProductVersion
{
public int VersionNumber { get; set; }
public int SubVersionNumber { get; set; }
public string VersionName { get; set; }
}
A distinct product is one defined where the Reference and ProductVersion values are unalike. The ProductVersion will also only be distinct if all its properties are unalike.
For testing and demonstration purposes the following code will be used:
// Build collection
List<Product> products = new List<Product>();
products.Add(new Product() { Reference = "AA", Description = "Bike AA", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "A" } });
products.Add(new Product() { Reference = "AA", Description = "Bike AAv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "A" } });
products.Add(new Product() { Reference = "AA", Description = "Bike AAv3", ProductVersion = new ProductVersion() { VersionNumber = 2, SubVersionNumber = 0, VersionName = "A" } });
products.Add(new Product() { Reference = "AB", Description = "Bike AB", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "B" } });
products.Add(new Product() { Reference = "AC", Description = "Bike ACv1", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "C" } });
products.Add(new Product() { Reference = "AC", Description = "Bike ACv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "C" } });
// Add some duplicates
products.Add(new Product() { Reference = "AA", Description = "Bike AAv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "A" } });
products.Add(new Product() { Reference = "AA", Description = "Bike AAv3", ProductVersion = new ProductVersion() { VersionNumber = 2, SubVersionNumber = 0, VersionName = "A" } });
products.Add(new Product() { Reference = "AC", Description = "Bike ACv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "C" } });
IEnumerable<Product> distinctProducts = products.Distinct<Product>(new ProductEqualityComparer());
foreach (Product product in distinctProducts.OrderBy(prod => prod.UniqueIdentifier)) Console.WriteLine(product.ToString());
The line creating the distinct collection is:
IEnumerable<Product> distinctProducts =
products.Distinct<Product>(new ProductEqualityComparer());
The initial attempt at an IEqualityComparer implementation may be something along the lines of:
public class ProductEqualityComparer : IEqualityComparer<Product>
{
public bool Equals(Product x, Product y)
{
// If reference same object including null then return true
if (object.ReferenceEquals(x, y))
{
return true;
}
// If one object null the return false
if (object.ReferenceEquals(x, null) || object.ReferenceEquals(y, null))
{
return false;
}
// Compare properties for equality
return (x.Reference == y.Reference && x.ProductVersion == y.ProductVersion);
}
public int GetHashCode(Product obj)
{
return obj.GetHashCode();
}
}
If you run and debugs this code you will quickly see that the Distinct() operation does not return a collection of products that are distinct from each other according to the stated criteria for uniqueness. If you debug the code you will also see that the Equals() method of the comparer class never actually gets called. So what is happening?
The important method here is GetHashCode(). What is actually happening is that the Distinct() operation is first using the comparers implementation of GetHashCode() to first derive a hash-code for the Product. Only when a product returns like hash-codes is Equals() called.
This implementation returns a hash-code based on the object instance. Thus one based on the product values is needed:
public int GetHashCode(Product obj)
{
if (object.ReferenceEquals(obj, null))
{
return 0;
}
int referencehHash =
obj.Reference == null ? 0 : obj.Reference.GetHashCode();
int versionHash =
obj.ProductVersion == null ? 0 : obj.ProductVersion.GetHashCode();
return referencehHash ^ versionHash;
}
Of course this implementation assumes a couple of things. Firstly that equality comparisons of the ProductVersion return meaningful results, and secondly that the GetHashCode() returns a value based on the ProductVersion instance values; rather than the base implementation. For completeness here is the ProductVersion implementation which overrides the == operator:
public class ProductVersion
{
public int VersionNumber { get; set; }
public int SubVersionNumber { get; set; }
public string VersionName { get; set; }
public static bool operator ==(ProductVersion x, ProductVersion y)
{
// If reference same object including null then return true
if (object.ReferenceEquals(x, y))
{
return true;
}
// If one object null the return false
if (object.ReferenceEquals(x, null) || object.ReferenceEquals(y, null))
{
return false;
}
// Compare object property values
return VersionsEqual(x, y);
}
public static bool operator !=(ProductVersion x, ProductVersion y)
{
return !(x == y);
}
public override bool Equals(object obj)
{
// If comparing to null return false
if (object.ReferenceEquals(obj, null))
{
return false;
}
// Cast and compare object property values
ProductVersion version = (ProductVersion)obj;
return VersionsEqual(this, version);
}
public override int GetHashCode()
{
int nameHash = this.VersionName == null ? 0 : this.VersionName.GetHashCode();
return nameHash ^ this.VersionNumber ^ this.SubVersionNumber;
}
private static bool VersionsEqual(ProductVersion x, ProductVersion y)
{
// Compare properties for equality
return (x.VersionNumber == y.VersionNumber && x.SubVersionNumber == y.SubVersionNumber && x.VersionName == y.VersionName);
}
}
Remember, one has to use object.ReferenceEquals() calls for null checking when overriding the == operator
The IEqualityComparer thus firstly utilizes the product hash-code to place the product into a hast table. Only if a collision occurs is is the Equals() method called to determine uniqueness. You can easily see this in action if you code a GetHashCode() method that returns constant value for all products.
Once all this has been implemented the Distinct() methods works as expected, returning:
Product: AA, Version: A-1-0
Product: AA, Version: A-1-1
Product: AA, Version: A-2-0
Product: AB, Version: B-1-0
Product: AC, Version: C-1-0
Product: AC, Version: C-1-1
In conclusion, distinct elements are found only when the hash-code are equal and the products with the same hash-code are equal based on the Equals() method call.
Hopefully one can see that this implementation is a lot more efficient than a process based on sorting the collection; and explains why the returned collection is in an unspecified order.
Written by Carl Nolan
Comments
Anonymous
December 06, 2012
That makes complete sense. I was wondering why the equals method wasn't being called. Thanks!Anonymous
June 06, 2013
"You can easily see this in action if you code a GetHashCode() method that returns constant value for all products." Why should I not just do this anyway instead of duplicating the logic in trying to create a HashCode. Surely the comparison is more expressive in the Equals() method?