SQL prefix wildcard search on very large table

Phyxsius 26 Reputation points
2021-01-12T09:31:55.233+00:00

Hi,

I am looking for the best practice in finding text with a prefi wildcard within a very large table.
To give you an idea, the product table has approx. 2 million products. It has a cluster as non-cluster indexes.

Example of possible values:

  1. Fountain
  2. Fountaincap
  3. Fountain in my garden
  4. Frozenfountain
  5. Dirt in my fountain water
  6. Frozenfountain2
  7. Fountai with a typo
  8. There is no better Fountain

The following statement takes to much time, more than 30 sec.

SELECT * FROM Products WHERE ProductName LIKE '%Fountain%'

I was thinking of using Full Text Search service but prefix wildcards are not allowed here.

SELECT * FROM Products WHERE CONTAINS(ProductName,'Fountain*')

So, any ideas of making search queries much faster would be much apricated.
All ideas are welcome, if possible would out additional software to prevent dependencies.

Thank you in advance

Developer technologies | Transact-SQL
{count} votes

6 answers

Sort by: Most helpful
  1. Olaf Helper 47,436 Reputation points
    2021-01-12T13:08:32.06+00:00

    Several years ago I implemented a customer search index, a N-Gram Search Index. Works well & fast even with large data (in my case more then 13 million addresses), but requires a lot of additional data (space), see

    https://social.technet.microsoft.com/wiki/contents/articles/33419.sql-server-implementation-of-n-gram-search-index.aspx

    1 person found this answer helpful.
    0 comments No comments

  2. Erland Sommarskog 121.6K Reputation points MVP Volunteer Moderator
    2021-01-12T23:07:55.547+00:00

    Olaf's idea seems to be the best. I just gave it a quick look, but it seems to be the same idea I once wrote for a book quite a few years ago.

    What I like to add that if you are using LIKE, you can reduce the execution time if you can use varchar and an SQL collation. With varchar, an SQL collation only has 255 characters, and the comparison rules are a lot simpler than the full Unicode monty that you get with a Windows collation.

    Also, an index on the column helps if there are lot of other big columns in the table, so there is less data to scan.

    But it will still take several seconds. If you build your index as Olaf suggest, you can cut it to sub-second.

    1 person found this answer helpful.
    0 comments No comments

  3. natig gurbanov 1 Reputation point
    2021-01-12T14:16:01.343+00:00
    0 comments No comments

  4. Viorel 122.6K Reputation points
    2021-01-12T16:02:47.26+00:00

    If you want to research your own hand-made approaches, then maybe create a helper table (@sequences) like this:

    declare @table table ( id int, txt varchar(max) )
    declare @sequences table ( id int, seq varchar(max), rev bit )
    
    insert @table values
    ( 1, 'Fountain                   ' ),
    ( 2, 'Fountaincap                ' ),
    ( 3, 'Fountain in my garden      ' ),
    ( 4, 'Frozenfountain             ' ),
    ( 5, 'Dirt in my fountain water  ' ),
    ( 6, 'Frozenfountain2            ' ),
    ( 7, 'Fountai with a typo        ' ),
    ( 8, 'There is no better Fountain' )
    
    ;
    with W1 as 
    ( 
        select id, trim([value]) as word
        from 
        @table
        cross apply string_split( trim(txt), ' ')
    ),
    W2 as
    (
        select id, word as seq, 0 as rev
        from W1
        union 
        select id, reverse(word), 1
        from W1
    ),
    S as
    (
        select id, seq, rev
        from W2
        union all
        select id, right(seq, len(seq)-1), rev
        from S
        where len(seq) > 1
    )
    insert @sequences
    select * 
    from S
    option (maxrecursion 0)
    
    select *
    from @sequences
    order by id, rev
    

    This sample uses local tables, but the data can be stored to some real table and indexed for 'xxx%' searches. Then, in order to find ‘%Fountain%’, consider this query:

    declare @to_find varchar(max) = 'Fountain'
    
    select distinct t.*
    from @table t
    inner join @sequences s on s.id = t.id
    where (s.rev = 0 and s.seq like @to_find + '%')
    or (s.rev = 1 and s.seq like reverse(@to_find) + '%')
    

    which probably can be further optimized.

    The approach is less suitable if the contents is frequently changed.

    0 comments No comments

  5. Tom Phillips 17,771 Reputation points
    2021-01-12T18:56:52.173+00:00

    A table with 2 million rows is not really that many. Trying to pre-optimize this query may not result in any performance increase. If you have enough RAM, this will all be cached.

    What is the data type of ProductName? What version of SQL Server are you using?

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.