AI Search tokenize phone numbers

Jay Lin 20 Reputation points
2024-06-25T04:29:33.57+00:00

Hi there,

I have got another question to build a customize phone number analyzer.

For instance, +61 2 8364 5809 will be found when user searches:

  1. 61 2 8364 5809
  2. 61283645809
  3. 8364 5809
  4. 83645809
  5. 8364
  6. 836
  7. 5809

Not found if user searches

  1. 809

I have PatternCaptureTokenFilter (PreserveOriginal = true) to clean up "+", "(", ")" and space.

var phoneFilter = new PatternCaptureTokenFilter("phone_filter", new string[] { "([^()\\+\\s]+)" });
phoneFilter.PreserveOriginal = true;
tokenFilterList.Add(phoneFilter);
var phoneCleanupFilter = new PatternReplaceTokenFilter("phone_cleanup_filter", "\\W+", string.Empty);
tokenFilterList.Add(phoneCleanupFilter);

Custom-Phone

This analyzer can fulfill all the requirements except #4, but as soon as I implemented EdgeNGramTokenFilter after phoneFilter and phoneCleanupFilter to get the right 8 to 10 digits, all the tokens generated above that are less than 8 will be removed.

var eightEdgeGramsFilter = new EdgeNGramTokenFilter("8_10_edgegrams");
eightEdgeGramsFilter.MinGram = 8;
eightEdgeGramsFilter.MaxGram = 10;
eightEdgeGramsFilter.Side = EdgeNGramTokenFilterSide.Back;
tokenFilterList.Add(eightEdgeGramsFilter);

EdgeGrams-8-10

Is there a way to PreserveOriginal in EdgeNGramTokenFilter? Or is there a better way to get the right 8, 10 digits?

Regards,

Jay

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
980 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Jay Lin 20 Reputation points
    2024-08-20T03:22:21.23+00:00

    Instead of using out of box EdgeNGramTokenFilter, I managed to build a custom token filter by extending PatternCaptureTokenFilter to take the last 8 to 10 digits.

    var tokenFilter = new PatternCaptureTokenFilter("8_10_digits_filter", new string[] { "(\\d{8})(?!.*\\d)", "(\\d{9})(?!.*\\d)", "(\\d{10})(?!.*\\d)" }) { PreserveOriginal = true }

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.