Weird behavior in FTS

areller 1 Reputation point
2021-06-02T20:02:49.733+00:00

I have a table with a column (FullPath) that can represent either a URL (e.g. https://example.com/abc/123) or a path in the file system (/home/user/file.txt)

When I use CONTAINS to perform full text search on that column, everything works great, until I start searching using dashes (-).

I know that dashes are known to cause issues with FTS (https://learn.microsoft.com/en-US/troubleshoot/sql/admin/dashes-ignored-search-full-text-msidxs-queries), but I've noticed a strange behavior, and I wanted to know if that strange behavior is deliberate and whether I can take advantage of it or not.

I've noticed that when I have a - in my search term, it would consistently detect records that are URLs (begin with something://), but not so much if it's a file system path

For example, if my search term is foo-bar,

This will be detected, https://example.com/foo-bar/file.txt

But this will not, /home/user/foo-bar/file.txt

It looks like all records that begin with the pattern of something:// will work nicely when the search term has -.

Is this behavior documented anywhere?

If so, would it be viable to take advantage of it and prepend something like file:// to all of the file system paths?

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
14,494 questions
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,703 questions
0 comments No comments
{count} votes

5 answers

Sort by: Most helpful
  1. Erland Sommarskog 120.2K Reputation points MVP
    2021-06-08T19:43:18.67+00:00

    If we overlook the haphazard behaviour which appears to be some special, there is actually a solution, if the aim is to parse paths of various sorts. And that is to use language 0 for the full-text index. 0 is language-neutral, and I have used this when building full-text index for Source Code.

    This scripts finds all foo-bar.

    DROP TABLE ullig
    CREATE TABLE ullig(id int NOT NULL, text nvarchar(800) NOT NULL,
                       CONSTRAINT pk_ullig PRIMARY KEY(id))
    
    INSERT ullig(id, text) 
       VALUES (1, 'https://example.com/foo-bar/file.txt'),
              (2, '/home/user/foo-bar/file.txt'),
              (3, '/home/user/foobar/file.txt'),
              (4, '/home/user/brand-new/file.txt'),
              (5, '/home/user/hunter/file.txt'),
              (12, 'home user foo-bar file.txt'),
              (13, 'home user foobar file.txt'),
              (14, 'home user brand-new file.txt'),
              (15, 'home user hunter file.txt')
    
    
    CREATE FULLTEXT INDEX ON ullig(text LANGUAGE 0)
       KEY INDEX pk_ullig
       ON katta
    go
    SELECT * FROM ullig
    WHERE CONTAINS(text, '"foo-bar"')
    

    By the way, the special may only be for English, language 1033. I also tried language 1053, i.e. Swedish, and it also found me all instances of for-bar.

    1 person found this answer helpful.
    0 comments No comments

  2. Erland Sommarskog 120.2K Reputation points MVP
    2021-06-02T21:53:20.957+00:00

    The hyphen has nothing to do with it. I can't say exactly what is going on, but I tried this:

    SELECT *FROM sys.fulltext_catalogs
    
    CREATE FULLTEXT CATALOG katta
    
    
    CREATE TABLE ullig(id int NOT NULL, text nvarchar(800) NOT NULL,
                       CONSTRAINT pk_ullig PRIMARY KEY(id))
    
    INSERT ullig(id, text) 
       VALUES (1, 'https://example.com/foo-bar/file.txt'),
              (2, '/home/user/foo-bar/file.txt')
    
    
    CREATE FULLTEXT INDEX ON ullig(text LANGUAGE 1033)
       KEY INDEX pk_ullig
       ON katta
    

    Then I ran for instance:

    SELECT * FROM ullig
    WHERE CONTAINS(text, '"bar"')
    

    And that only hits id = 1.

    This query shows which keywords the parser has identified and in how many documents.

    SELECT * FROM sys.dm_fts_index_keywords(db_id(), object_id('ullig'))
    

  3. EchoLiu-MSFT 14,616 Reputation points
    2021-06-03T08:18:26.16+00:00

    Hi @areller

    In a Chinese full-text search, due to the lack of the noise.chs file, the full-text search using different keywords will have different search results. But when it was made up, the problem was solved. So I guess whether it is a similar reason.

    Regards
    Echo

    0 comments No comments

  4. Ronen Ariely 15,201 Reputation points
    2021-06-07T05:54:52.49+00:00

    Hi,

    Just joining the discussion with some insights which are mostly guess and not something I am sure of

    I know that dashes are known to cause issues with FTS

    (1) The document which you present explain something else. It explain that when the dashes are used in the value that you search for then it is ignored but your case is that the dashes are in the values you search in. The document explicitly speak on "Using the dash (-) in a proximity search".

    By "search for" I mean the parameter in the CONTAINS; By "search in" I mean the value in the table

    This row seems to be returned too: (16, '/home/user/foo-bar/file.tx').

    (2) Notice that this will be the same in other known file's extensions. For example using ".zip" vs using ".zi". Seems like when we have a know file extension then it does not returned (not sure why).

    Here some more data

    truncate table ullig
    GO
    INSERT ullig(id, text) 
       VALUES (1, 'https://example.com/foo-bar/file.txt'),
              (2, '/home/user/foo-bar/file.txt'),
              (3, '/home/user/foobar/file.txt'),
              (4, 'home user foo-bar file.txt'),
              (5, 'home user foobar file.txt'),
              (6, '/home/user/foo-bar/file.tx'),
              (7, '/home/user/foo-bar/file.zip'),
              (8, '/home/user/foo-bar/file.zi'),
              (9, 'https://example.com/foo-bar/file')
    GO
    INSERT ullig(id, text) 
       VALUES (11, 'https://example.com/foo-bar/file.txt'),
              (12, 'ronen://home/user/foo-bar/file.txt'),
              (13, 'ronen://home/user/foobar/file.txt'),
              (14, 'ronen://home user foo-bar file.txt'),
              (15, 'ronen://home user foobar file.txt'),
              (16, 'ronen://home/user/foo-bar/file.tx'),
              (17, 'ronen://home/user/foo-bar/file.zip'),
              (18, 'ronen://home/user/foo-bar/file.zi'),
              (19, 'https://example.com/foo-bar/file')
    GO
    

    seems in first glance like the behavior is like this:

    when the text looks like a relative path which point to a real file (file with known extension) then "foo-bar" is considered as one word "foo bar", and therefore it is NOT returned

    but for example, when the text looks like full path (start with "*://") or the extension of the last word does not fit a known file name, then the phrase "foo bar" considered as two words and it returned

    0 comments No comments

  5. hilary cotter 11 Reputation points
    2021-06-07T21:33:32.533+00:00

    This will explain why you see this:

    select * from sys.dm_fts_parser('https://example.com/foo-bar/file.txt', 1033, 0, 0)  
    
    select * from sys.dm_fts_parser('/home/user/foo-bar/file.txt', 1033, 0, 0)  
    

    The parser breaks the url into tokens but treats foo-bar as a single token. When it is a file path, it breaks foo-bar at the hyphen. Why this is, I can't explain, but it was a decision made by Microsoft perhaps at a client request.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.