SQL prefix wildcard search on very large table

Question

SQL prefix wildcard search on very large table

Phyxsius 26

Hi,

I am looking for the best practice in finding text with a prefi wildcard within a very large table.
To give you an idea, the product table has approx. 2 million products. It has a cluster as non-cluster indexes.

Example of possible values:

Fountain
Fountaincap
Fountain in my garden
Frozenfountain
Dirt in my fountain water
Frozenfountain2
Fountai with a typo
There is no better Fountain

The following statement takes to much time, more than 30 sec.

SELECT * FROM Products WHERE ProductName LIKE '%Fountain%'

I was thinking of using Full Text Search service but prefix wildcards are not allowed here.

SELECT * FROM Products WHERE CONTAINS(ProductName,'Fountain*')

So, any ideas of making search queries much faster would be much apricated.
All ideas are welcome, if possible would out additional software to prevent dependencies.

Thank you in advance

EchoLiu-MSFT 14,621 Reputation points

2021-01-15T07:03:00.153+00:00

Do you have any update?

6 answers

Your answer

EchoLiu-MSFT 14,621 Reputation points

2021-01-15T07:03:00.153+00:00

Do you have any update?

Answer 1

Olaf Helper 47,441

Several years ago I implemented a customer search index, a N-Gram Search Index. Works well & fast even with large data (in my case more then 13 million addresses), but requires a lot of additional data (space), see

https://social.technet.microsoft.com/wiki/contents/articles/33419.sql-server-implementation-of-n-gram-search-index.aspx

Answer 2

Olaf's idea seems to be the best. I just gave it a quick look, but it seems to be the same idea I once wrote for a book quite a few years ago.

What I like to add that if you are using LIKE, you can reduce the execution time if you can use varchar and an SQL collation. With varchar, an SQL collation only has 255 characters, and the comparison rules are a lot simpler than the full Unicode monty that you get with a Windows collation.

Also, an index on the column helps if there are lot of other big columns in the table, so there is less data to scan.

But it will still take several seconds. If you build your index as Olaf suggest, you can cut it to sub-second.

Answer 3

natig gurbanov 1

use
unique clustered indexed view for easy working with large data

https://learn.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views?view=sql-server-ver15

Answer 4

If you want to research your own hand-made approaches, then maybe create a helper table (@sequences) like this:

declare @table table ( id int, txt varchar(max) )
declare @sequences table ( id int, seq varchar(max), rev bit )

insert @table values
( 1, 'Fountain                   ' ),
( 2, 'Fountaincap                ' ),
( 3, 'Fountain in my garden      ' ),
( 4, 'Frozenfountain             ' ),
( 5, 'Dirt in my fountain water  ' ),
( 6, 'Frozenfountain2            ' ),
( 7, 'Fountai with a typo        ' ),
( 8, 'There is no better Fountain' )

;
with W1 as 
( 
    select id, trim([value]) as word
    from 
    @table
    cross apply string_split( trim(txt), ' ')
),
W2 as
(
    select id, word as seq, 0 as rev
    from W1
    union 
    select id, reverse(word), 1
    from W1
),
S as
(
    select id, seq, rev
    from W2
    union all
    select id, right(seq, len(seq)-1), rev
    from S
    where len(seq) > 1
)
insert @sequences
select * 
from S
option (maxrecursion 0)

select *
from @sequences
order by id, rev

This sample uses local tables, but the data can be stored to some real table and indexed for 'xxx%' searches. Then, in order to find ‘%Fountain%’, consider this query:

declare @to_find varchar(max) = 'Fountain'

select distinct t.*
from @table t
inner join @sequences s on s.id = t.id
where (s.rev = 0 and s.seq like @to_find + '%')
or (s.rev = 1 and s.seq like reverse(@to_find) + '%')

which probably can be further optimized.

The approach is less suitable if the contents is frequently changed.

Answer 5

Tom Phillips 17,771

A table with 2 million rows is not really that many. Trying to pre-optimize this query may not result in any performance increase. If you have enough RAM, this will all be cached.

What is the data type of ProductName? What version of SQL Server are you using?

Share via

SQL prefix wildcard search on very large table

6 answers

Your answer