Document cracking for HTML file

Hub=-For-=Dev 80 Reputation points
2025-11-26T12:56:05.1266667+00:00

Recently I am working on a POC on Azure AI Search about extracting textual content of a HTML file and populate to search index.

However, the content extracted by the indexer in document cracking stage is not so well formatted as there are a lot of \n and \t and blank space inside it.

User's image

Q1) apart from creating a custom web API and invoke the API using skillset, any other preferred or easier method to deal with this situation?

Q2) As skillset will be performed after document cracking stage, will there be any method that we can conduct data cleansing before document cracking?

Thanks in advance.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
0 comments No comments
{count} votes

Answer accepted by question author
  1. Sina Salam 26,661 Reputation points Volunteer Moderator
    2025-11-27T15:44:34.53+00:00

    Hello Hub=-For-=Dev,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your document cracking for HTML file and regarding your questions:

    I will give you three practicable approaches, pick the one that fits your needs:

    Option A: Index/analyzer approach - good if you only need better search behavior (not a cleaned stored content value)

    Reason: Create a custom analyzer/char filters (e.g., pattern_replace and html_strip) on the index field so tokens ignore \n/\t and extra whitespace. This improves search matching and ranking, but may not change the stored content field value returned by search/document retrieval. Use when you only need queries to behave well.

    Option B: In-pipeline clean (recommended for both search + returning cleaned content) - DocumentExtractionSkill + small custom skill that normalizes whitespace

    The idea is that you will let the indexer perform document cracking (so you still get Document Extraction behavior such as HTML tag stripping), then run a simple custom skill (Azure Function / Web API) in the skillset that receives /document/content (or DocumentExtractionSkill output), runs a deterministic regex to collapse runs of \r\n\t and multiple spaces into single spaces and trims leading/trailing whitespace, and then writes that cleaned string to an output field mapped into your index.

    This is a good tradeoff because, skills run after cracking (that’s the pipeline design): use DocumentExtractionSkill (or the indexer default extraction) to obtain text and then normalize it with a small custom skill. This produces a cleaned field you can both search and display. It’s explicit, robust, and under your control.

    Option C: Pre-ingest cleaning (if you can change the ingestion flow), it's best if you control the source pipeline. For the fact that you can clean the HTML (strip tags and normalize whitespace) before the indexer runs e.g., when you upload to blob storage, run a pre-processor that writes an additional metadata field or a parallel blob with the cleaned content. Then configure the indexer to read that cleaned blob/metadata field (field mapping) instead of raw file content.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.