Document cracking for HTML file

Question

Document cracking for HTML file

Hub=-For-=Dev 80

Recently I am working on a POC on Azure AI Search about extracting textual content of a HTML file and populate to search index.

However, the content extracted by the indexer in document cracking stage is not so well formatted as there are a lot of \n and \t and blank space inside it.

User's image

Q1) apart from creating a custom web API and invoke the API using skillset, any other preferred or easier method to deal with this situation?

Q2) As skillset will be performed after document cracking stage, will there be any method that we can conduct data cleansing before document cracking?

Thanks in advance.

Answer accepted by question author

0 additional answers

Your answer

Answer 1

Hello Hub=-For-=Dev,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your document cracking for HTML file and regarding your questions:

I will give you three practicable approaches, pick the one that fits your needs:

Option A: Index/analyzer approach - good if you only need better search behavior (not a cleaned stored content value)

Reason: Create a custom analyzer/char filters (e.g., pattern_replace and html_strip) on the index field so tokens ignore \n/\t and extra whitespace. This improves search matching and ranking, but may not change the stored content field value returned by search/document retrieval. Use when you only need queries to behave well.

Option B: In-pipeline clean (recommended for both search + returning cleaned content) - DocumentExtractionSkill + small custom skill that normalizes whitespace

The idea is that you will let the indexer perform document cracking (so you still get Document Extraction behavior such as HTML tag stripping), then run a simple custom skill (Azure Function / Web API) in the skillset that receives /document/content (or DocumentExtractionSkill output), runs a deterministic regex to collapse runs of \r\n\t and multiple spaces into single spaces and trims leading/trailing whitespace, and then writes that cleaned string to an output field mapped into your index.

This is a good tradeoff because, skills run after cracking (that’s the pipeline design): use DocumentExtractionSkill (or the indexer default extraction) to obtain text and then normalize it with a small custom skill. This produces a cleaned field you can both search and display. It’s explicit, robust, and under your control.

Option C: Pre-ingest cleaning (if you can change the ingestion flow), it's best if you control the source pipeline. For the fact that you can clean the HTML (strip tags and normalize whitespace) before the indexer runs e.g., when you upload to blob storage, run a pre-processor that writes an additional metadata field or a parallel blob with the cleaned content. Then configure the indexer to read that cleaned blob/metadata field (field mapping) instead of raw file content.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Hub=-For-=Dev 80 Reputation points

2025-11-28T06:57:11.8033333+00:00

These options are the options I can think of too! Thanks @Sina Salam !

Before I mark your answer as Accepted Answer. Just wanna make sure if you know anything out-of-the-box in Azure AI Search to cater Question 2?
Sina Salam 26,661 Reputation points Volunteer Moderator

2025-11-28T09:30:01.8266667+00:00

Hi

Thanks for your feedback.

There is no single solution that could do that, Azure AI Search does not provide a built-in pre-cracking cleansing step because:

Document cracking is the first step in the enrichment pipeline.

Skills only run after cracking.

The options you have with Azure to combine two or more services are the followings:

Option 1: Blob + Custom analyzer > Indexer (default extraction) > Search

Option 2: Blob > Indexer > DocumentExtraction > CustomSkill clean > Index

Option 3: Source > Azure Function/LogicApp > clean Blob/JSON > Indexer (no custom skill)

Success.

Share via

Document cracking for HTML file

0 additional answers

Your answer