T-SQL Fulltext Search v. Irish Names with Apostrophes

Giuseppe Carafa 96 Reputation points
2020-10-12T15:05:11.763+00:00

I've created a couple of Full Text Indexes indexing certain bits of text.

The software is used in Ireland and the text in the full text indexes will therefore inevitably contain Irish names such as O'Connell, O'Brien and whatnot.

It looks to me that the full text index indexes these as whole words, i.e. O'Connell is seen as a single word.

As a result, a search for just Connell will not find the text that contains O'Connell

Have advised customer that they can search for Connell OR O'Connell but we received very little love in response.

Is there anything we can do here? We're not allowing leading wildcards as we want our database to perform.

Developer technologies | Transact-SQL
{count} votes

Accepted answer
  1. Erland Sommarskog 121.9K Reputation points MVP Volunteer Moderator
    2020-10-12T21:01:53.537+00:00

    I did some testing and with English as the language O'Brien is seen as a unit and "Brien" alone does not appear when I look in sys.dm_fts_index_keywords_by_document.

    If I set the language to Swedish or 0x0, I get hits when searching on Brien alone. But I can see that would have other repercussions for you.

    I assume that you have control over what is actually passed to SQL Server. That is, you could inject something to the search term? An idea is that you regularly run

    SELECT substring(display_term, 3, len(display_term) 
    FROM sys.dm_fts_index_keywords_by_document(db_id(),
                                  object_id('YourTable')) 
    WHERE display_term LIKE 'o''%'
    

    You save that list in your application, and when the user enters a search condition, you parse the string and an OR condition on behalf of the user.

    1 person found this answer helpful.
    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. MelissaMa-MSFT 24,221 Reputation points
    2020-10-13T01:54:51.927+00:00

    Hi @Giuseppe Carafa ,

    A full-text search considers a word to be a string of characters without spaces or punctuation. The occurrence of a non-alphanumeric character can "break" a word during a search. Because the SQL Server full-text search is a word-based engine, punctuation generally is not considered and is ignored when searching the index.

    Sorry that I don't have FT installed locally and have not tested this.

    I did some searching and please try below options and check whether one of them is helpful to you. Thanks.

    One: To use ...WHERE name = 'O''Connel' or ...WHERE name LIKE 'o''%'.

    Two: To switch the language of the full text index to English (and re-index), or create your own custom word-breaker to allow apostrophes.
    Refer:Configure & manage word breakers & stemmers for search (SQL Server)

    Three: To add a new column with the "fullName" stripped of its apostrophes and created a FTS index on it. It was able to find both the O'Briens and the Obriens using a contains (fullSearch, '"obrien*"').

    Four: To change the language of the relevant column to Neutral (LCID=0):

    alter fulltext index on table_name disable  
    alter fulltext index on table_name drop (fts_column_name)  
    alter fulltext index on table_name add (fts_column_name LANGUAGE 0 )  
    alter fulltext index on table_name enable  
    

    Five: To try with SOUNDEX (Transact-SQL).

    Best regards
    Melissa


    If the answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.
    0 comments No comments

  2. Giuseppe Carafa 96 Reputation points
    2020-10-14T10:52:46.353+00:00

    If I could accept both suggested answers I would. Both may be useful to some readers, depending on their specific situation.

    Sadly neither can be implemented on our site, but as luck would have it we finally managed to convince the users that having to search for something like

    Connell OR O'Connell

    is not the end of the world. All's well that ends well, I guess.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.