Fuzzy Matching with number of occurance in sql server

Uma 421 Reputation points
2020-10-06T19:21:41.41+00:00

I am using sql server 2016.

My Source table data column looks like

DECLARE @SOURCE TABLE  
(  
SRCNAME VARCHAR(100)  
)  
  
INSERT @SOURCE  
  
   
SELECT 'Stephan sBaum' UNION ALL                       
SELECT 'Ida debSou' UNION ALL            
SELECT 'Hector nicoCarrasco' UNION ALL                  
SELECT 'LUKE RUDD' UNION ALL   
SELECT 'S.Caljouw' UNION ALL          
SELECT 'Christelle Bregnauld' UNION ALL                  
SELECT 'Mike'  

And Target table data looks like.

DECLARE @target TABLE
(
TGTNAME VARCHAR(100)
)

INSERT @target

SELECT 'Stephen' UNION ALL
SELECT 'Hacktere' UNION ALL
SELECT 'Hacktery' UNION ALL
SELECT 'Stephan' UNION ALL
SELECT 'luky rodd' UNION ALL
SELECT 'Christ' union all
select 'Mike'

i am trying to do fuzzy match and check which all values are present in target and number of occurance.

Trying to get output in 2 table

Output 1 looks like.

30564-hash1.png

and output2 looks like.

30565-hash2.png

Is there any way to achieve this kind of issue in sql server 2016

Please help me.

Thanks a lot

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,516 questions
0 comments No comments
{count} votes

Accepted answer
  1. MelissaMa-MSFT 24,176 Reputation points
    2020-10-07T05:20:05.21+00:00

    Hi @Uma ,

    As mentioned by Erland, we tried with SSIS's fuzzy matching component. We could have below output. It seems that it is not working with some situations like 'LUKE RUDD' and 'luky rodd'.

    30430-ssis.png

    Later I found one good function 'Jaro-Winkler Stored Procedure' mentioned in
    Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL and it could be the perfect solution for your situation.

    Firstly, you could create the function as mentioned in above link.

    Then we could use this function and define the fuzzy matching degree should be more than 0.6 as below:

    drop table if exists #src   
      
    DECLARE @SOURCE TABLE  
     (  
     SRCNAME VARCHAR(100)  
     )  
          
     INSERT @SOURCE  
          
           
     SELECT 'Stephan sBaum' UNION ALL                       
     SELECT 'Ida debSou' UNION ALL            
     SELECT 'Hector nicoCarrasco' UNION ALL                  
     SELECT 'LUKE RUDD' UNION ALL   
     SELECT 'S.Caljouw' UNION ALL          
     SELECT 'Christelle Bregnauld' UNION ALL                  
     SELECT 'Mike'  
      
    DECLARE @TARGET TABLE  
    (  
    TGTNAME VARCHAR(100)  
    )  
      
    INSERT @TARGET  
      
    SELECT 'Stephen' UNION ALL  
    SELECT 'Hacktere' UNION ALL  
    SELECT 'Hacktery' UNION ALL  
    SELECT 'Stephan' UNION ALL  
    SELECT 'luky rodd' UNION ALL  
    SELECT 'Christ' union all  
    select 'Mike'  
      
    ;with cte as (  
    select a.SRCNAME,b.TGTNAME,[dbo].[fn_calculateJaroWinkler](a.SRCNAME,b.TGTNAME) similar  from @SOURCE a  
    cross apply @TARGET b)  
      
    select * into #src   
    from cte   
    where similar>0.6  
    and left(SRCNAME,1)=LEFT(TGTNAME,1)  
      
    select a.SRCNAME SCRCOLUMN,   
    sum(case when b.SRCNAME is null then 0 else 1 end) [No Of Occurance]  
    from @SOURCE a  
    left join #src b on a.SRCNAME=b.SRCNAME  
    group by a.SRCNAME  
    order by a.SRCNAME  
      
      
    select a.SRCNAME [SRC Name], b.TGTNAME [Targert Name]  
    from @SOURCE a  
    left join #src b on a.SRCNAME=b.SRCNAME  
    

    Output1:

    SCRCOLUMN	No Of Occurance  
    Christelle Bregnauld	1  
    Hector nicoCarrasco	2  
    Ida debSou	0  
    LUKE RUDD	1  
    Mike	1  
    S.Caljouw	0  
    Stephan sBaum	2  
    

    Output2:

    SRC Name	Targert Name  
    Stephan sBaum	Stephen  
    Stephan sBaum	Stephan  
    Ida debSou	NULL  
    Hector nicoCarrasco	Hacktere  
    Hector nicoCarrasco	Hacktery  
    LUKE RUDD	luky rodd  
    S.Caljouw	NULL  
    Christelle Bregnauld	Christ  
    Mike	Mike  
    

    Best regards
    Melissa


    If the answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Erland Sommarskog 98,901 Reputation points
    2020-10-06T22:09:31.53+00:00

    There are the soundex and differences functions, but they are not very powerful.

    You need specialised tools for this. SSIS has a fuzzy matching component, I believe. Don't really know how good it is. SQL Server comes with Data Quality Services, but stay away, it shipped with SQL 2012 originally and has been more or less abandoned since then. My fried Dejan Sarka tells me that it is a piece of junk. Unfortunately, I don't work in this area myself, so I don't know what products that are out there.

    0 comments No comments