Using DENSE_RANK with SOUNDEX to Detect Imperfect Duplicate Values in Character Data
The DENSE_RANK function is a window function that provides an increasing integer value based on repetition in the ordering clause of the window. The ranks can be broken down into groups using the partition by option in the over clause of the window function.
The SOUNDEX function returns an alphanumeric code indicating a consonant sound pattern (w,h,y are treated as vowels for this purpose). The pattern begins with the first letter of the character string irrespective of its being a consonant or vowel. The pattern that follows the first letter is a number each digit representing which group of consonants each detected consonant belongs to.
By building a table valued expression that includes a soundex against the target character column and running a dense rank against it you have seeds for keying columns on the basis of both a shared soundex (the dense rank) and the permutations of character values sharing a soundex value (the soundex and the key off of the source).
Consider the following:
IF OBJECT_ID ('tempdb..#t_soundexrank_example_source') IS NOT NULL DROP TABLE #t_soundexrank_example_source
CREATE TABLE #t_soundexrank_example_source (c_rownum INT IDENTITY(1,1), c_charval VARCHAR (25));
IF OBJECT_ID ('t_soundexrank_example_dest') IS NOT NULL DROP TABLE t_soundexrank_example_dest;
CREATE TABLE t_soundexrank_example_dest
(c_rownum INT IDENTITY(1,1), c_charval VARCHAR (25), c_soundex CHAR(4), c_startchar CHAR(1), c_partdrank INT, c_partrow INT);
WITH CTE1 AS
(SELECT c_rownum, c_charval, SOUNDEX(c_charval) AS c_soundex, LEFT(c_charval,1) AS c_startchar FROM #t_soundexrank_example_source)
INSERT INTO t_soundexrank_example_dest
SELECT c_charval, c_soundex, c_startchar,
DENSE_RANK() OVER (PARTITION BY c_startchar ORDER BY c_soundex) AS c_partdrank,
ROW_NUMBER() OVER (PARTITION BY c_startchar ORDER BY c_soundex) AS c_partrow
FROM CTE1
DECLARE @t_charlist TABLE (c_startchar char(1));
DECLARE @v_char CHAR(1);
DECLARE @v_idxname VARCHAR(15);
INSERT INTO @t_charlist SELECT DISTINCT (c_startchar) FROM t_soundexrank_example_dest;
WHILE EXISTS (SELECT c_startchar FROM @t_charlist)
BEGIN
SET @v_char =
(SELECT TOP 1 c_startchar FROM @t_charlist);
SET @v_idxname = 'ix_charlookup_' + @v_char;
EXEC (
'CREATE INDEX ' + @v_idxname + ' ON
#t_soundex_example_dest (c_soundex)
INCLUDE (c_partdrank, c_partrow, c_rownum)
WHERE c_startchar = @v_char;'
);
DELETE TOP(1) FROM @t_charlist;
END
;
GO
CREATE FUNCTION f_soundexref
(
@p_charlookup VARCHAR(25)
)
RETURNS TABLE
RETURN
SELECT c_rownum, c_partdrank
FROM t_soundexrank_example_dest
WHERE LEFT(SOUNDEX(@p_charlookup),1) = c_startchar
AND SOUNDEX(@p_charlookup) = c_soundex
;
By populating into the source temporary table the character valued list you wish to examine you can move through the CTE into the destination temporary table values that represent the soundex scoring along with values that provide indications of repetition with in the group defined by the first character of the strings being examined. Extended to a more robust case this would allow for combined function or procedure and indexing strategies that could be used to speed up the querying of fuzzy matched strings.
The preceding code should generate indexes for each of the starting characters. This does not take into consideration the distributions of such characters in naturally occurring datasets. Also in the code below is a function that should provide an effective utilization of these indexes to search for the DENSE_RANK and row number pairs (for mapping to the source) for all values sharing a SOUNDEX pattern.
*