Why tokenization moves a letter to preceding word?

Tc̄no John 20 Reputation points
2023-04-28T15:31:35.61+00:00

Why would the tokenization of french text data move a specific letter to the number that precedes it ?

A number in a french full-text catalog cannot be found because of an intriguing behavior of the tokenization.

This text data is in a french full-text catalog but the number 1123477 cannot be found by CONTAINS.

Félix Thecat 1123477 Félix123 felixthecat

The tokenization of this string returns tokens "1123477f" and "elix123" instead of "1123477" and "felix123"

SELECT * FROM sys.dm_fts_parser('"Félix Thecat 1123477 Félix123 felixthecat"',1036,NULL,0) ;

It only happens if the word following the number starts with a capital F and a letter with a diacritic or F alone.

It does not happen when the number starts with 0.

It does not happen with other languages.

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,822 questions
0 comments No comments
{count} votes

Accepted answer
  1. Erland Sommarskog 111.5K Reputation points MVP
    2023-04-28T20:42:24.1966667+00:00

    That has to be a bug.

    You can report bugs here: https://feedback.azure.com/d365community/forum/04fe6ee0-3b25-ec11-b6e6-000d3a4f0da0

    However, reporting bugs here is only to let Microsoft know about the issue. They are under no obligation to fix it, and if they fix it, they fix in their own pace. If this is a blocking issue for you and you need a fix more urgently, you should open a support case. Be prepared that the support person may be asking about the business impact of the problem.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. AniyaTang-MSFT 12,446 Reputation points Microsoft Vendor
    2023-05-01T02:22:46.4733333+00:00

    Hi @Tc̄no John

    As Erland said, this should be a bug.

    You could submit the requirement at https://feedback.azure.com/d365community/forum/04fe6ee0-3b25-ec11-b6e6-000d3a4f0da0.
    If the requirement mentioned by customers for many times, the product team may consider to fix this feature in the next version. Your feedback is valuable for us to improve our products and increase the level of service provided.

    Best regards,

    Aniya

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.