Find approximate matches with fuzzy string functions
Real-world data rarely matches perfectly. Customer names might be misspelled, addresses abbreviated differently, or product descriptions entered inconsistently. Fuzzy string matching functions help you find records that are similar but not identical, enabling data quality improvements, duplicate detection, and more flexible search capabilities.
Understand string similarity concepts
Fuzzy matching algorithms measure how similar two strings are by calculating the differences between them. Two primary approaches are commonly used:
Edit distance (Levenshtein distance) counts the minimum number of single-character operations (insertions, deletions, substitutions) needed to transform one string into another. Lower values indicate more similar strings.
Similarity scores express the relationship between strings as a percentage or ratio, where higher values indicate greater similarity.
Consider these examples:
- "color" → "colour": edit distance = 1 (insert 'u')
- "database" → "databaes": edit distance = 2 (swap 'e' and 's')
- "Microsoft" → "Microsft": edit distance = 1 (delete 'o')
Note
Fuzzy matching is computationally expensive compared to exact matching. Use it strategically, typically on candidate sets that have been pre-filtered using other criteria.
Calculate edit distance with EDIT_DISTANCE
The EDIT_DISTANCE function returns the Levenshtein distance between two strings, which is the minimum number of edits required to transform one string into the other. The goal is to find strings that are similar based on a defined threshold.
The following example demonstrates how to use EDIT_DISTANCE:
SELECT
EDIT_DISTANCE('color', 'colour') AS ColorVariant, -- Returns 1
EDIT_DISTANCE('database', 'databaes') AS Typo, -- Returns 2
EDIT_DISTANCE('SQL Server', 'SQL Server') AS Exact, -- Returns 0
EDIT_DISTANCE('hello', 'world') AS Different; -- Returns 4
You can use EDIT_DISTANCE to find records that might be duplicates or matches despite slight variations:
-- Find customers with similar names to a search term
DECLARE @searchName NVARCHAR(100) = 'Jon Smith';
SELECT
CustomerID,
FirstName,
LastName,
FirstName + ' ' + LastName AS FullName,
EDIT_DISTANCE(@searchName, FirstName + ' ' + LastName) AS EditDistance
FROM SalesLT.Customer
WHERE EDIT_DISTANCE(@searchName, FirstName + ' ' + LastName) <= 3
ORDER BY EDIT_DISTANCE(@searchName, FirstName + ' ' + LastName);
Also, you can find potential duplicate products:
-- Find product pairs with similar names
SELECT
p1.ProductID AS Product1ID,
p1.Name AS Product1Name,
p2.ProductID AS Product2ID,
p2.Name AS Product2Name,
EDIT_DISTANCE(p1.Name, p2.Name) AS EditDistance
FROM SalesLT.Product AS p1
INNER JOIN SalesLT.Product AS p2
ON p1.ProductID < p2.ProductID
WHERE EDIT_DISTANCE(p1.Name, p2.Name) <= 5
ORDER BY EDIT_DISTANCE(p1.Name, p2.Name);
Tip
The maximum meaningful edit distance depends on string length. For short strings (5-10 characters), an edit distance of 1-2 indicates similarity. For longer strings, you might allow distances of 3-5.
Measure similarity with EDIT_DISTANCE_SIMILARITY
EDIT_DISTANCE_SIMILARITY returns a normalized similarity score between 0 and 100, where 100 represents identical strings. This percentage-based metric is easier to interpret than raw edit distance, especially when comparing strings of different lengths:
SELECT
EDIT_DISTANCE_SIMILARITY('color', 'colour') AS ColorSimilarity, -- ~85
EDIT_DISTANCE_SIMILARITY('database', 'databaes') AS TypoSimilarity, -- ~75
EDIT_DISTANCE_SIMILARITY('SQL', 'SQL Server') AS PartialMatch, -- ~30
EDIT_DISTANCE_SIMILARITY('hello', 'hello') AS Exact; -- 100
You can use similarity scores to find approximate matches with a threshold like the following example:
-- Find products similar to a search term (at least 70% similar)
DECLARE @searchTerm NVARCHAR(100) = 'Mountain Bike Frame';
SELECT
ProductID,
Name,
EDIT_DISTANCE_SIMILARITY(@searchTerm, Name) AS SimilarityScore
FROM SalesLT.Product
WHERE EDIT_DISTANCE_SIMILARITY(@searchTerm, Name) >= 70
ORDER BY EDIT_DISTANCE_SIMILARITY(@searchTerm, Name) DESC;
Calculate phonetic similarity with JARO_WINKLER_DISTANCE
The Jaro-Winkler algorithm is specifically designed for comparing names and short strings. It gives higher scores to strings that match from the beginning, making it particularly effective for person names where prefixes are more significant:
SELECT
JARO_WINKLER_DISTANCE('MARTHA', 'MARHTA') AS NameTypo, -- ~0.96
JARO_WINKLER_DISTANCE('JONES', 'JOHNSON') AS SimilarNames, -- ~0.83
JARO_WINKLER_DISTANCE('JOHN', 'JON') AS NameVariant, -- ~0.93
JARO_WINKLER_DISTANCE('SMITH', 'SMYTH') AS SpellingVar; -- ~0.96
The Jaro-Winkler score ranges from 0 to 1, where 1 indicates identical strings. A score above 0.9 typically indicates a strong match for names.
The following example finds customers with names similar to a search input:
-- Find customers with names similar to a search
DECLARE @searchFirst NVARCHAR(50) = 'John';
DECLARE @searchLast NVARCHAR(50) = 'Smythe';
SELECT
CustomerID,
FirstName,
LastName,
JARO_WINKLER_DISTANCE(@searchFirst, FirstName) AS FirstNameScore,
JARO_WINKLER_DISTANCE(@searchLast, LastName) AS LastNameScore,
(JARO_WINKLER_DISTANCE(@searchFirst, FirstName) +
JARO_WINKLER_DISTANCE(@searchLast, LastName)) / 2 AS CombinedScore
FROM SalesLT.Customer
WHERE JARO_WINKLER_DISTANCE(@searchFirst, FirstName) > 0.85
AND JARO_WINKLER_DISTANCE(@searchLast, LastName) > 0.85
ORDER BY CombinedScore DESC;
Note
Jaro-Winkler is optimized for short strings like names. For longer strings like addresses or descriptions, EDIT_DISTANCE_SIMILARITY often provides better results.
Performance considerations
Fuzzy matching functions examine every character in both strings, making them computationally intensive. Exact string comparison can stop as soon as characters differ, and indexed lookups use efficient B-tree traversal. In contrast, fuzzy algorithms must calculate similarity scores character by character. For a table with one million rows, an unoptimized fuzzy search might perform one million similarity calculations, each involving dozens of character comparisons.
The key to efficient fuzzy matching is reducing the candidate set before applying the expensive fuzzy functions. Use indexed columns with LIKE patterns, exact matches on related fields, or range filters to narrow results first. Only then apply fuzzy matching to the smaller candidate set.
The following examples show this progressive filtering approach:
-- Not good: Fuzzy match against entire table
SELECT * FROM LargeCustomerTable
WHERE EDIT_DISTANCE_SIMILARITY('John Smith', FullName) > 70;
-- Better: Pre-filter before fuzzy matching
SELECT * FROM LargeCustomerTable
WHERE FullName LIKE 'J%' -- First letter filter
AND EDIT_DISTANCE_SIMILARITY('John Smith', FullName) > 70;
-- Best: Use multiple pre-filters
SELECT * FROM LargeCustomerTable
WHERE FirstName LIKE 'Jo%'
AND LastName LIKE 'Sm%'
AND JARO_WINKLER_DISTANCE('John', FirstName) > 0.85
AND JARO_WINKLER_DISTANCE('Smith', LastName) > 0.85;
Important
Fuzzy string matching functions like EDIT_DISTANCE, EDIT_DISTANCE_SIMILARITY, and JARO_WINKLER_DISTANCE are available in SQL Server 2025 and later, Azure SQL Database, and SQL databases in Microsoft Fabric. Check your platform's documentation for specific feature availability.
For more information about fuzzy string matching, see String Functions.