T-SQL Puzzle - combining duplicates

Peace Out 1 Reputation point
2020-11-13T13:43:27.07+00:00

hello all - we have a software that detects duplicate employee records and spits out the duplicates in combination of two employees at a time in the format below. how can I tag all the combination of these employees with a unique id?

DECLARE @Employees TABLE (EmployeeID1 VARCHAR(50), EmployeeID2 VARCHAR(50))

INSERT INTO @Employees VALUES ('E1', 'E2')

INSERT INTO @Employees VALUES ('E2', 'E3')

INSERT INTO @Employees VALUES ('E4', 'E1')

SELECT * FROM @Employees

For example, all the employees above are one and the same according to the software. I need to generate a random number to tag these employees to indicate they are one and the same.

Developer technologies Transact-SQL
{count} votes

6 answers

Sort by: Most helpful
  1. David Browne - msft 3,851 Reputation points
    2020-11-16T21:34:05.197+00:00

    Here's a solution using SQL Server Graph Database Capabilities.

    What we're after here is called an Equivalence Relation:

    In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric and transitive. The relation "is equal to" is the canonical example of an equivalence relation, where for any objects a, b, and c:

    a = a (reflexive property),
    if a = b then b = a (symmetric property), and
    if a = b and b = c, then a = c (transitive property).

    As a consequence of the reflexive, symmetric, and transitive properties, any equivalence relation provides a partition of the underlying set into disjoint equivalence classes. Two elements of the given set are equivalent to each other, if and only if they belong to the same equivalence class.

    https://en.wikipedia.org/wiki/Equivalence_relation

    So we're looking to compute the "equivilence class" that each Employee belongs to.

    In SQL Server a Graph edge table stores edges of a Directed Graph, and a directed graph can be used to model a binary relation. It also provides graph query extensions to TSQL, including SHORTEST_PATH which can be used to calculate the transitive closure of the relation modeled by the graph. So to make the graph model an equivalence relation, we just need to add all the inverse edges, and the identity edges.

    eg:

    --drop table if exists DuplicateOf  
    --drop table if exists Employee  
       
    CREATE TABLE Employee (EmployeeID varchar(50)) AS NODE;  
    CREATE TABLE DuplicateOf  AS EDGE;  
       
    insert into Employee(EmployeeID) values ('E1'),('E2'),('E3'),('E4'),('E5'),('E6'),('E7'),('E8')  
       
    INSERT INTO DuplicateOf VALUES   
    (  
      (SELECT $node_id FROM Employee WHERE EmployeeID = 'E1'), (SELECT $node_id FROM Employee WHERE EmployeeID = 'E2')   
    ),  
    (  
      (SELECT $node_id FROM Employee WHERE EmployeeID = 'E2'), (SELECT $node_id FROM Employee WHERE EmployeeID = 'E3')   
    ),  
    (  
      (SELECT $node_id FROM Employee WHERE EmployeeID = 'E4'), (SELECT $node_id FROM Employee WHERE EmployeeID = 'E1')   
    ),  
    (  
      (SELECT $node_id FROM Employee WHERE EmployeeID = 'E5'), (SELECT $node_id FROM Employee WHERE EmployeeID = 'E6')   
    );  
       
    --add the inverse of each edge to model a symmetric relation  
    INSERT DuplicateOf($from_id, $to_id)  
    SELECT $to_id, $from_id   
    FROM DuplicateOf;  
       
     --add add each employeeid to make model a reflexive relation  
    INSERT DuplicateOf($from_id, $to_id)  
    SELECT $node_id, $node_id   
    FROM Employee;  
          
    with q as  
    (  
        SELECT  
            e1.EmployeeID,  
            last_node =   
                LAST_VALUE(e2.EmployeeID)  
                    WITHIN GROUP (GRAPH PATH)  
        FROM  
            dbo.Employee AS e1,   
            dbo.DuplicateOf FOR PATH AS dup,  
            dbo.Employee FOR PATH AS e2   
        WHERE MATCH(SHORTEST_PATH(e1(-(dup)->e2)+))  
    )  
    SELECT  
    q.EmployeeID,  
    MainEmployeeId = min(q.last_node),  
    EquivilenceClass =    
                STRING_AGG(q.last_node, ',')  
                WITHIN GROUP (ORDER BY q.last_node)  
      
    FROM q  
    GROUP BY q.EmployeeID;  
    

    outputs

    EmployeeID    MainEmployeeId    EquivilenceClass  
    ------------- ----------------- -------------------  
    E1            E1                E1,E2,E3,E4  
    E2            E1                E1,E2,E3,E4  
    E3            E1                E1,E2,E3,E4  
    E4            E1                E1,E2,E3,E4  
    E5            E5                E5,E6  
    E6            E5                E5,E6  
    E7            E7                E7  
    E8            E8                E8  
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.