Encoding and similarity functions

Completed

SQL Server 2025 introduces new functions for string encoding and similarity measurement. These features support secure data transformation, fuzzy matching, and natural-language comparisons directly within T-SQL.
They make it easier to detect duplicates, standardize records, or find near matches without relying on external tools or user-defined functions.

Overview of Encoding and Similarity Functions

BASE64_ENCODE

Converts a string or binary value into Base64 format. Commonly used to encode credentials, binary data, or text that needs to be safely stored or transmitted through JSON or web APIs.

BASE64_DECODE

Reverses the encoding, converting Base64-encoded text back into its original form.

Together, these two functions help manage secure, portable data exchange between SQL Server and other systems.


STRING_SIMILARITY

Returns a similarity score (between 0.0 and 1.0) comparing two strings based on character sequences.
A higher score means the two strings are more alike.

EDIT_DISTANCE

Computes the number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

EDIT_DISTANCE_SIMILARITY

Converts edit distance into a similarity score, where 1.0 represents an exact match and 0.0 means no similarity.

JARO_WINKLER_DISTANCE

Measures the distance between two strings based on transpositions and prefix matches. Commonly used for comparing names or short identifiers.

JARO_WINKLER_SIMILARITY

Returns a similarity score from 0.0 to 1.0 using the Jaro-Winkler algorithm, emphasizing similarity at the beginning of the strings.
It’s useful for matching names, company records, or user input that might contain typos.


Example 1: Encode and Decode Sensitive Data

A system administrator wants to store API keys in encoded form to prevent accidental exposure during audits or exports. The keys can be Base64-encoded before being written to a configuration table.

Query: Encode and Decode Example

DECLARE @ApiKey NVARCHAR(100) = 'AppKey-2025-SECURE';
DECLARE @Encoded NVARCHAR(MAX);
DECLARE @Decoded NVARCHAR(MAX);

-- Encode the API key
SET @Encoded = BASE64_ENCODE(@ApiKey);

-- Decode it back
SET @Decoded = BASE64_DECODE(@Encoded);

SELECT @ApiKey AS OriginalValue, @Encoded AS EncodedValue, @Decoded AS DecodedValue;

Results

OriginalValue EncodedValue DecodedValue
AppKey-2025-SECURE QXBwS2V5LTIwMjUtU0VDVVJF AppKey-2025-SECURE

This example shows how data can be safely encoded for storage or transfer and decoded when needed, all within T-SQL.


Example 2: Compare Strings for Similarity

A data quality engineer needs to detect near-duplicate entries in a customer table caused by inconsistent data entry.
Using SQL Server 2025’s similarity functions, it’s possible to identify records that differ only slightly, such as typos or extra spaces.

Sample Data

CustomerID CustomerName
1 Jonathon Smith
2 Jonathan Smith
3 J. Smith
4 John Smith

Query: Compare Names with Similarity Functions

SELECT 
    A.CustomerName AS NameA,
    B.CustomerName AS NameB,
    STRING_SIMILARITY(A.CustomerName, B.CustomerName) AS SimilarityScore,
    EDIT_DISTANCE(A.CustomerName, B.CustomerName) AS EditSteps,
    JARO_WINKLER_SIMILARITY(A.CustomerName, B.CustomerName) AS JaroScore
FROM dbo.Customers A
JOIN dbo.Customers B
    ON A.CustomerID < B.CustomerID
ORDER BY SimilarityScore DESC;

Results

NameA NameB SimilarityScore EditSteps JaroScore
Jonathon Smith Jonathan Smith 0.93 1 0.94
Jonathan Smith John Smith 0.77 4 0.79
Jonathon Smith J. Smith 0.68 6 0.70
J. Smith John Smith 0.62 5 0.66

This output shows that "Jonathon Smith" and "Jonathan Smith" are nearly identical, while "J. Smith" and "John Smith" share partial similarity.
These scores can help identify probable duplicates before merging or cleaning data.


Example 3: Combining Encoding and Similarity

In some systems, encoded identifiers need to be compared indirectly. By decoding and then applying similarity, you can still detect patterns or duplicates across encoded data.

DECLARE @Encoded1 NVARCHAR(MAX) = BASE64_ENCODE('User-713');
DECLARE @Encoded2 NVARCHAR(MAX) = BASE64_ENCODE('User-713X');

SELECT 
    @Encoded1 AS EncodedA,
    @Encoded2 AS EncodedB,
    STRING_SIMILARITY(
        BASE64_DECODE(@Encoded1),
        BASE64_DECODE(@Encoded2)
    ) AS SimilarityScore;

Result

EncodedA EncodedB SimilarityScore
VXNlci03MTM= VXNlci03MTNY 0.93

This example shows that even after decoding Base64 values, SQL Server can compute similarity scores without extra processing layers.


Summary

SQL Server 2025 extends T-SQL with built-in functions for Base64 encoding and string similarity analysis.
Developers can now encode, decode, and compare strings directly in SQL queries to enhance data quality, detect near duplicates, and handle secure text transformations.
These tools simplify tasks that previously required external scripts or CLR integration, improving both performance and maintainability for text-heavy workloads.