Thanks for the follow-up and sharing the workaround.
Based on my understanding of your scenario and the issue, it seems that the issue you are facing is related to the indexer limits of Azure Cognitive Search. As mentioned in this doc, Azure Cognitive Search imposes indexer limits on how much text it extracts depending on the pricing tier. A warning will appear in the indexer status response if documents are truncated.
For the Basic tier, it is 64,000 characters. Since your original document has 1.8M characters (which is much larger than the limit of the Basic tier), as you pointed out, it is likely that the indexer was not able to extract all the text from the document.
To avoid this issue, you can try breaking apart documents with large amounts of text into multiple, smaller documents ( as you figured out the workaround) or you can also try using a higher pricing tier, such as the Standard tier, which has a limit of 4 million characters
To index the whole file, you may need to upgrade to a higher tier that supports larger documents. You may check the Service Limits in Azure Cognitive Search doc to see the limits for each tier.
Reference : ( limits mentioned in this Azure doc, at this time of submitting this answer).
Indexers limit how much text can be extracted from any one document. This limit depends on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, 4 million for Standard, 8 million for Standard S2, and 16 million for Standard S3. Text that was truncated won't be indexed. To avoid this warning, try breaking apart documents with large amounts of text into multiple, smaller documents.