How to configure fuzzy matching using Azure Cognitive Search with full Lucene Syntax to cater for errors in the middle of strings?
I'm trying to build a query that is able to return back fuzzy matches from an index which I will simplify as the below:
{
"@odata.context": "",
"@odata.etag": "",
"name": "index7",
"defaultScoringProfile": null,
"fields": [
{
"name": "JurisdictionCode",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": null,
"vectorSearchConfiguration": null,
"synonymMaps": []
},
{
"name": "Aliases",
"type": "Collection(Edm.ComplexType)",
"fields": [
{
"name": "OriginalName",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": null,
"vectorSearchConfiguration": null,
"synonymMaps": []
},
{
"name": "NormalName",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "normal_name_analyzer",
"normalizer": null,
"dimensions": null,
"vectorSearchConfiguration": null,
"synonymMaps": []
}
]
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [
{
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "normal_name_analyzer",
"tokenizer": "normal_name_tokenizer",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
],
"normalizers": [],
"tokenizers": [
{
"@odata.type": "#Microsoft.Azure.Search.NGramTokenizer",
"name": "normal_name_tokenizer",
"minGram": 3,
"maxGram": 3,
"tokenChars": []
}
],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
"k1": null,
"b": null
},
"semantic": null,
"vectorSearch": null
}
I am querying the index via the Python SDK using the current query shape and options:
Query: (Aliases/NormalName:newyorkknicks AND JurisdictionCode:"us_ny")
Options: {"query_type": "full", "search_mode": "all", "top": 20}
This query succeeds in returning the correct record newyorkknicks
.
If I then update the query by removing elements either from the start, end or both, the correct record is still identified e.g.
Start removed: (Aliases/NormalName:wyorkknicks AND JurisdictionCode:"us_ny")
End removed: (Aliases/NormalName:newyorkknic AND JurisdictionCode:"us_ny")
Both removed: (Aliases/NormalName:wyorkknic AND JurisdictionCode:"us_ny")
However, whenever an internal element of the string is removed, no records are returned e.g.
(Aliases/NormalName:newyrkknicks AND JurisdictionCode:"us_ny")
Just by looking at the strings, newyrkknicks
has a much higher NGram similarity than wyorkknic
, so I can't see why the latter is able to return a match while the former cannot. It seems like there might be some sort of edge NGram similarity at play here, but I haven't configured it like that (at least I don't think I have!).
Does anyone have any suggestions as to what I'm doing wrong here?