Hello Hub=-For-=Dev,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that your document cracking for HTML file and regarding your questions:
I will give you three practicable approaches, pick the one that fits your needs:
Option A: Index/analyzer approach - good if you only need better search behavior (not a cleaned stored content value)
Reason: Create a custom analyzer/char filters (e.g., pattern_replace and html_strip) on the index field so tokens ignore \n/\t and extra whitespace. This improves search matching and ranking, but may not change the stored content field value returned by search/document retrieval. Use when you only need queries to behave well.
Option B: In-pipeline clean (recommended for both search + returning cleaned content) - DocumentExtractionSkill + small custom skill that normalizes whitespace
The idea is that you will let the indexer perform document cracking (so you still get Document Extraction behavior such as HTML tag stripping), then run a simple custom skill (Azure Function / Web API) in the skillset that receives /document/content (or DocumentExtractionSkill output), runs a deterministic regex to collapse runs of \r\n\t and multiple spaces into single spaces and trims leading/trailing whitespace, and then writes that cleaned string to an output field mapped into your index.
This is a good tradeoff because, skills run after cracking (that’s the pipeline design): use DocumentExtractionSkill (or the indexer default extraction) to obtain text and then normalize it with a small custom skill. This produces a cleaned field you can both search and display. It’s explicit, robust, and under your control.
Option C: Pre-ingest cleaning (if you can change the ingestion flow), it's best if you control the source pipeline. For the fact that you can clean the HTML (strip tags and normalize whitespace) before the indexer runs e.g., when you upload to blob storage, run a pre-processor that writes an additional metadata field or a parallel blob with the cleaned content. Then configure the indexer to read that cleaned blob/metadata field (field mapping) instead of raw file content.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.