How to configure fuzzy matching using Azure Cognitive Search with full Lucene Syntax to cater for errors in the middle of strings?

Nick Petzold 5 Reputation points
2023-09-11T10:35:00.73+00:00

I'm trying to build a query that is able to return back fuzzy matches from an index which I will simplify as the below:

{
  "@odata.context": "",
  "@odata.etag": "",
  "name": "index7",
  "defaultScoringProfile": null,
  "fields": [
    {
      "name": "JurisdictionCode",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchConfiguration": null,
      "synonymMaps": []
    },
    {
      "name": "Aliases",
      "type": "Collection(Edm.ComplexType)",
      "fields": [
        {
          "name": "OriginalName",
          "type": "Edm.String",
          "searchable": false,
          "filterable": false,
          "retrievable": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": null,
          "normalizer": null,
          "dimensions": null,
          "vectorSearchConfiguration": null,
          "synonymMaps": []
        },
        {
          "name": "NormalName",
          "type": "Edm.String",
          "searchable": true,
          "filterable": false,
          "retrievable": true,
          "sortable": false,
          "facetable": false,
          "key": false,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "analyzer": "normal_name_analyzer",
          "normalizer": null,
          "dimensions": null,
          "vectorSearchConfiguration": null,
          "synonymMaps": []
        }
      ]
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "normal_name_analyzer",
      "tokenizer": "normal_name_tokenizer",
      "tokenFilters": [
        "lowercase"
      ],
      "charFilters": []
    }
  ],
  "normalizers": [],
  "tokenizers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.NGramTokenizer",
      "name": "normal_name_tokenizer",
      "minGram": 3,
      "maxGram": 3,
      "tokenChars": []
    }
  ],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": null,
  "vectorSearch": null
}

I am querying the index via the Python SDK using the current query shape and options:

Query: (Aliases/NormalName:newyorkknicks AND JurisdictionCode:"us_ny")

Options: {"query_type": "full", "search_mode": "all", "top": 20}

This query succeeds in returning the correct record newyorkknicks.

If I then update the query by removing elements either from the start, end or both, the correct record is still identified e.g.

Start removed: (Aliases/NormalName:wyorkknicks AND JurisdictionCode:"us_ny")

End removed: (Aliases/NormalName:newyorkknic AND JurisdictionCode:"us_ny")

Both removed: (Aliases/NormalName:wyorkknic AND JurisdictionCode:"us_ny")

However, whenever an internal element of the string is removed, no records are returned e.g.

(Aliases/NormalName:newyrkknicks AND JurisdictionCode:"us_ny")

Just by looking at the strings, newyrkknicks has a much higher NGram similarity than wyorkknic, so I can't see why the latter is able to return a match while the former cannot. It seems like there might be some sort of edge NGram similarity at play here, but I haven't configured it like that (at least I don't think I have!).

Does anyone have any suggestions as to what I'm doing wrong here?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,339 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.