Indexing html - token lenghts and highlights

Question

Indexing html - token lenghts and highlights

Perak Piotr 1

I'm trying to index HTML contents. For this I have built custom analyzer.

{
  "name": "html",
  "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
  "tokenizer": "standard_v2",
  "tokenFilters": [
    "lowercase"
  ],
  "charFilters": [
    "html_strip"
  ]
}

and assigned it to my HTML field

{
      "name": "htmlTest",
      "type": "Edm.String",
      "facetable": false,
      "filterable": false,
      "key": false,
      "retrievable": true,
      "searchable": true,
      "sortable": false,
      "analyzer": "html",
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    }

When I POST to analyze endpoint

{
    "text": "<p><strong>bold</strong> <i>italic</i> normal</p>",
    "analyzer": "html"
}

I see that it extract contents correctly but I'm not sure about their endOffset values. I get this back.

"tokens": [
        {
            "token": "bold",
            "startOffset": 11,
            "endOffset": 24,
            "position": 0
        },
        {
            "token": "italic",
            "startOffset": 28,
            "endOffset": 38,
            "position": 1
        },
        {
            "token": "normal",
            "startOffset": 39,
            "endOffset": 45,
            "position": 2
        }
    ]

If you look first token it's endOffset = 24 which is not something I expected. It includes closing  tag. This is causing issues when I want to get back highlights. It highlights this tag too.

bold italic normal

Is there any way I can improve my analyzer?

SnehaAgrawal-MSFT 22,706 Reputation points Moderator

2021-09-21T12:43:47.973+00:00

Thanks! You may want to refer this official document on how queries work, see this article on full text search.

Also, please note that Partial term queries are an important exception to this rule. These queries (prefix query, wildcard query, regex query) bypass the lexical analysis process unlike regular term queries. Partial terms are only lowercased before being matched against terms in the index.

If an analyzer isn't configured to support these types of queries, you'll often receive unexpected results because matching terms don't exist in the index.

Check this document on understanding how analyzers work.
Perak Piotr 1 Reputation point

2021-09-21T13:01:55.927+00:00

I know how analyzers work and saw this document. But I still don't know if/how can I change my configuration to return correct highlights in HTML.

1 answer

Your answer

SnehaAgrawal-MSFT 22,706 Reputation points Moderator

2021-09-21T12:43:47.973+00:00

Thanks! You may want to refer this official document on how queries work, see this article on full text search.

Also, please note that Partial term queries are an important exception to this rule. These queries (prefix query, wildcard query, regex query) bypass the lexical analysis process unlike regular term queries. Partial terms are only lowercased before being matched against terms in the index.

If an analyzer isn't configured to support these types of queries, you'll often receive unexpected results because matching terms don't exist in the index.

Check this document on understanding how analyzers work.
Perak Piotr 1 Reputation point

2021-09-21T13:01:55.927+00:00

I know how analyzers work and saw this document. But I still don't know if/how can I change my configuration to return correct highlights in HTML.

Answer 1

SnehaAgrawal-MSFT 22,706 Moderator

@Perak Piotr Apologize for inconvenience with the issue. I had an internal discussion with PG and its identified a current incompatibility between the "html_strip" charFilter and highlighting. The product team is aware of the issue and working on this. There is no ETA for resolution at this specific moment.

Work around suggested: In the meantime, as a workaround you may try client-side highlighting when the analyzer uses html_strip char filter.

Please let us know if you have any questions or concerns and we’ll be happy to help.

Piotr Perak 1 Reputation point

2021-09-29T10:38:05.103+00:00

Thank you for clearing this up.

I know there's no ETA, but when it's fixed can you let me know?
Or tell me where I can track status of it?
SnehaAgrawal-MSFT 22,706 Reputation points Moderator

2021-09-29T10:41:04.263+00:00

Thanks for reply! Sure will keep you posted here.

Share via

Indexing html - token lenghts and highlights

1 answer

Your answer