Azure Cognitive Search Language Analyzers and Tokenizers

Anna Fischer 0 Reputation points Microsoft Employee
2023-02-06T21:13:57.59+00:00

I've ran into an issue with languages that don't use many spaces (japanese and chinese for example) and the wildcard character "*". I worked around it by sending the text to the analyzer first, then searching the index with spaces between the tokens, plus the wildcard on the end of the last token only. I want the wildcard on the last token so it can still search as you type. However, it is slower to make two calls to acs this way (one to analyzer then one to search the index). Is it possible to make a custom analyzer that mimics the behavior of a language analyzer, but processes the wildcard after tokenizing the query? 

Here is an example for a chinese input:

input: "取消订阅*"

because of the wildcard here, analysis is skipped and the string is being treated as a single token, causing it to not match anything in the index.

As a work around, I send this string to the zh-hans.Microsoft analyzer to get back the tokenized version, which is "取消" (token 1) and "订阅" (token 2)

then, I make a new query with spaces between the tokens and add the wildcard to the last token like this: "取消 订阅*"

So now, I match several documents in the index, showing these suggestions:

"suggestions": [
"取消许可证",
"取消订阅",
"取消订阅",
"无法取消分配电子书",
"取消订阅订阅"
],

 

I would prefer not to have to tokenize the query first before searching the index, since it happens again anyway when you search an index so it is just slowing down the process. But, since the wildcard causes text to be treated as a single token, I have to do this in order to get the results I want. Since I don't know chinese, I would have a hard time making a custom analyzer or tokenizer since I don't know the right place to split the words with spaces. I still want the wildcard on the last token so partial words can be searched, but I don't want the entire string to be treated as a single token.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,350 questions
{count} votes

1 answer

Sort by: Most helpful
  1. ajkuma 28,036 Reputation points Microsoft Employee Moderator
    2023-02-22T12:54:32.72+00:00

    If your requirement fits, you may create a custom analyzer that mimics the behavior of a language analyzer but processes the wildcard after tokenizing the query. You leverage use the MicrosoftLanguageTokenizer and customize it to include the wildcard on the last token.

    Also, you may use the Analyze API to inspect the tokenized terms and see how the tokenizer is splitting the text. Additionally, you can use the maxTokenLength option to specify the maximum token length, which can help with tokenizing languages that don't use many spaces.

     

    Just to highlight, to create a custom analyzer, you need to specify the char filters, tokenizer, and token filters that you want to use. You can use the standard tokenizer and token filters, or you can create your own custom tokenizer and token filters.

     

    To add more info on this, you may leverage the keyword tokenizer along with the lowercase token filter to search for partial terms in Azure Cognitive Search. The keyword tokenizer will preserve the entire string as a single token, while the lowercase token filter will convert all characters to lowercase. This way, you can search for partial terms while still preserving the wildcard on the last token.

    If you need to search for prefix matches, you can add an EdgeNGramTokenFilter to your custom analyzer. This will help generate additional tokens for 2-25 character combinations, including characters in the prefix. This approach will result in a larger index, but it will also result in faster response times.

    Checkout these reference docs for more info

    Add custom analyzers to string fields in an Azure Cognitive Search index

    Partial term search and patterns with special characters (hyphens, wildcard, regex, patterns)

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.