Share via

How to Implement Fuzzy Matching in Azure Data Factory (ADF) with Similarity and Confidence Scoring Like SSIS Fuzzy Lookup?

Priyanka Rani 0 Reputation points
2025-07-08T13:32:24.3366667+00:00

Hi everyone,

In SQL Server Integration Services (SSIS), the Fuzzy Lookup transformation allows us to match input records against a reference table using fuzzy matching logic. It also automatically generates Similarity and Confidence columns for each matched record.

I’m currently in the process of migrating a data pipeline from SSIS to Azure Data Factory (ADF) and am trying to replicate this functionality.

Key objectives:

  • Perform fuzzy matching between two datasets (e.g., customer names).

Compute a similarity score or confidence level similar to what SSIS provides.

Retain the best match per input record, if applicable.

Questions:

What’s the recommended way to implement fuzzy matching in ADF?

Is there a built-in feature or transformation in ADF that replicates SSIS’s Fuzzy Lookup behavior?

How can we compute similarity or confidence scores in ADF? I’ve tried using algorithms like Levenshtein and Jaccard, but the results didn’t match the output from SSIS Fuzzy Lookup.

Are there any proven workarounds, such as using Azure Functions, Databricks, or custom activities, to achieve this in a scalable and maintainable way?

Any guidance, best practices, or examples would be greatly appreciated!

Thanks in advance.

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 5,840 Reputation points Microsoft External Staff Moderator
    2025-07-10T18:03:46.09+00:00

    Hi Priyanka Rani You’re absolutely right this is a common scenario when migrating SSIS workloads to Azure Data Factory (ADF). Unfortunately, ADF does not have a built-in transformation that fully replicates SSIS’s Fuzzy Lookup behavior especially the automatic generation of Similarity and Confidence scores for each matched record.

    ADF Data Flows do offer some fuzzy matching capabilities through:

    • The Join Transformation with approximate string matching logic.
    • Derived column expressions using custom similarity functions (like Levenshtein or Jaccard).

    Recommended Workarounds

    1. Here are the most common alternatives to achieve reliable fuzzy matching in ADF pipelines: Use Azure Databricks or Synapse Spark: This is the most flexible and scalable option for complex fuzzy matching. You can leverage Spark libraries like FuzzyWuzzy or write custom Python/Scala UDFs to compute Similarity and Confidence scores. The results can be written back to your Data Lake or SQL database for downstream use.
    2. Call an Azure Function or Custom Activity: For smaller workloads, you can build a custom Azure Function (Python or C#) with your fuzzy logic. Trigger it from your ADF pipeline to process data and return scores. This works well if you only need to match a moderate volume of records.
    3. Precompute or Cache Reference Matches: If your reference dataset doesn’t change often, you can precompute possible matches and similarity scores. Store these results in a lookup table and join them during your ADF pipeline runs.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.