Azure Search - CharFilter Not Replacing text from JSON Metadata in Index

Jesús López 5 Reputation points
2024-05-08T11:43:20.6633333+00:00

I am using Azure Search to index content from SharePoint, and I'm encountering an issue where some text are not being removed from JSON metadata despite using html_strip, mapping and pattern_replace character filters. Here is a detailed breakdown of my setup and the issue:

Index Configuration:

The data is fetched from SharePoint and contains fields like edSubProcess which are stored in JSON format with HTML content.

The filters I have set to map or replace characters are not correct. I have used them to test that any text is replaced. I have tried both plain text and regular expressions.

Here is an example of how the metadata appears:

{
    "name" : "sharepoint-car-index",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
        { "name": "edAreaSection", "type": "Edm.String", "key": false, "searchable": true, "filterable": true, "sortable": true, "facetable": true, "analyzer": "custom_cgl_analyzer" },
        { "name": "edSubProcess", "type": "Edm.String", "retrievable": true, "key": false, "searchable": true, "filterable": true, "sortable": false, "facetable": false, "analyzer": "custom_cgl_analyzer"},
        { "name": "edProcessBlock", "type": "Edm.String", "key": false, "searchable": true, "filterable": true, "sortable": false, "facetable": false, "analyzer": "custom_cgl_analyzer" }
    ],
      "analyzers": [
      {
        "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
        "name": "custom_cgl_analyzer",
        "charFilters": [ "md_mapper", "md_replacer", "html_strip"],
        "tokenizer": "standard_v2"
      }
    ],
      "charFilters": [
      {
        "name": "md_mapper",
        "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
        "mappings": [
            "004=>YYY", 
            "de=>try",
            "\\u0020=>"
        ]
      },

      {
        "name": "md_replacer",
        "@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
        "pattern": "00[0-9]",
        "replacement": "ZZZ"
      }
    ]
}

Result


"edAreaSection": "{\r\n  \"Label\": \"Logistic\",\r\n  \"TermGuid\": \"f04123b7-2e9b-34hh-b189-3cce3e20be7f\",\r\n  \"WssId\": 42\r\n}"

"edSubProcess": "{\r\n  \"Label\": \"004.03. Traffic\",\r\n  \"TermGuid\": \"e8d9fc6d-769c-4735-be6f-abb5039b827c\",\r\n  \"WssId\": 146\r\n}"
      
"edProcessBlock": "Bloque 004 - Traffic"

Problem:

Despite using the filters, the metadata are not being stripped away, and the metadata is still stored with all elements intact in Azure Search.

I need to clean up the JSON so I can create the field as a complexType so I can map the String to get the Label field.

Objective:

"edSubProcess": "004.03. Traffic"

Questions:

Is the character filters appropriate for removing text from within strings?

Are there any known limitations or considerations when applying the filters to fields that store JSON as a string?

What alternative approaches can I consider to strip HTML/MD content from JSON strings before indexing in Azure Search?

Any insights or recommendations on how to resolve this issue would be greatly appreciated!

I configured my Azure Search index to use the character filter with the hope that it would remove any characters content from the strings in the metadata fields such as edSubProcess. Specifically, I expected the filter to parse through the string values of JSON and remove any strings found therein, leaving behind clean text.

Can someone help me clean this metadata?

Thanks!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
741 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,435 questions
{count} vote