Document Intelligence prebuilt models...getting raw + structured text

Question

Document Intelligence prebuilt models...getting raw + structured text

Robert V 20

Using Document Intelligence prebuilt models to process receipts, is it possible to obtain a raw text output of the entire receipt alongside the structured data extracted by the model? Currently, the OCR identifies some text not included in prebuilt fields, highlighted in light yellow in Document Intelligence Studio, but without bounding boxes. The text I need is in those sections and is trivially identified by regular expression, if only I had the text.

At the moment, I have to run the receipt through both OCR/Read and the receipt model, but this is probably wasteful since there is likely redundant processing, and is definitely more costly. Is there a way to obtain a raw text dump and the structured data from the prebuilt models at once?

SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-04-07T11:47:38.2166667+00:00

@Robert V

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

Accepted answer

1 additional answer

Your answer

SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-04-07T11:47:38.2166667+00:00

@Robert V

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

You are right in your observation—when using Azure Document Intelligence (formerly Form Recognizer), especially the prebuilt models like prebuilt-receipt, the raw OCR output is not directly returned as-is along with the structured data.

However, here is the practical breakdown of your options:

Behavior of Prebuilt Models (prebuilt-receipt)

The prebuilt models do run OCR under the hood.
They extract structured fields (merchant name, total, tax, items).

Additional text present on the receipt but not part of the structured schema may not be exposed in the response (this is what you are seeing as highlighted in yellow in the studio).

Issue: This "extra" text is only partially exposed and does not include bounding boxes or full OCR data. It is not included in the API response unless specifically accessed.

Why using both Read & Receipt model is wasteful

You are absolutely right. Using both:

Read model: gives you raw OCR text and layout (bounding boxes, lines, words).

prebuilt-receipt model: gives structured receipt fields but does not expose full raw OCR text.

Running both results in duplicate OCR processing and higher cost.

Best Practice: Use prebuilt-receipt with includeTextDetails=true

When you call the prebuilt-receipt API, set the parameter:You are right in your observation—when using Azure Document Intelligence (formerly Form Recognizer), especially the prebuilt models like prebuilt-receipt, the raw OCR output is not directly returned as-is along with the structured data. However, here is the practical breakdown of your options --> includeTextDetails=true

This gives you:

The structured fields (MerchantName, Items, Total, etc.).

And also all raw OCR text, including bounding boxes, line text, words, and positions.

This is what you need to extract that "yellow highlighted" text using your own regex.

Output Structure with includeTextDetails

You can expect in the response:

analyzeResult.readResults --> Full raw OCR text, by page, with bounding boxes and lines.
analyzeResult.documentResults -> Structured receipt data.

So, you do not need to run Read separately. The prebuilt-receipt model with includeTextDetails=true gives you everything in one shot.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

Answer 2

Robert V 20

Thank you! I'll give it a try and will report back.

Share via

Document Intelligence prebuilt models...getting raw + structured text

1 additional answer

Your answer