I'm trying to index HTML contents. For this I have built custom analyzer.
{
"name": "html",
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "standard_v2",
"tokenFilters": [
"lowercase"
],
"charFilters": [
"html_strip"
]
}
and assigned it to my HTML field
{
"name": "htmlTest",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "html",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
When I POST to analyze endpoint
{
"text": "<p><strong>bold</strong> <i>italic</i> normal</p>",
"analyzer": "html"
}
I see that it extract contents correctly but I'm not sure about their endOffset values. I get this back.
"tokens": [
{
"token": "bold",
"startOffset": 11,
"endOffset": 24,
"position": 0
},
{
"token": "italic",
"startOffset": 28,
"endOffset": 38,
"position": 1
},
{
"token": "normal",
"startOffset": 39,
"endOffset": 45,
"position": 2
}
]
If you look first token it's endOffset = 24
which is not something I expected. It includes closing </strong>
tag. This is causing issues when I want to get back highlights. It highlights this tag too.
<p><strong><em>bold</strong></em> <i>italic</i> normal</p>
Is there any way I can improve my analyzer?