Azure Document Intelligence and Content Understanding

Question

Azure Document Intelligence and Content Understanding

omer 0

Hello,

Our customer has dozens of Excel and PDF files. These files come in various formats, and the formats may change over time. For example, some files provide data in a standard tabular structure, others use pivot-style Excel layouts, and some follow more complex or semi-structured formats.

We need to extract information from these files and ingest it into normalized tables. Therefore, our requirement is to automatically infer the structure of each file, extract the required values, and load them into Databricks tables.

There are dozens of different templates today, and new templates may emerge over time. Given this level of variability, what would be the recommended pipeline, tech stack and architecture? Should I prefer Document Intelligence or Content Understanding? Are these technologies reliable enough for understanding the file format and extracting value properly?

Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-14T12:05:40.01+00:00
Hi omer,

Thank you for reaching out on the Microsoft Q&A.

Azure Document Intelligence provides advanced capabilities for extracting and understanding content from documents using AI models. If you’re working with Content Understanding, here are key points and steps to ensure success:

It’s part of Azure AI Document Intelligence, designed to analyze structured and unstructured documents, extract text, tables, and semantic meaning, and enable downstream workflows like classification or summarization.

Common Scenarios

Processing invoices, receipts, contracts, or forms.

Extracting entities and relationships for automation.

Applying custom models for domain-specific documents.

How to Use It

Choose the Right Model

Prebuilt models for invoices, receipts, ID documents.

Custom models for specialized layouts or language.

Set Up the Endpoint

Use the documentIntelligence endpoint in your Azure resource.

Ensure correct region and authentication (API key or Microsoft Entra ID).

Send the Request

Use REST API or SDK (Python, C#, Java).

Include the document file or URL in the request body.

Specify the model ID (e.g., prebuilt-invoice or your custom model).

Handle the Response

Parse JSON output for extracted fields, tables, and confidence scores.

Apply post-processing logic for validation.

Troubleshooting Tips

If results are incomplete or confidence is low:

Check document quality (resolution, clarity).

Validate model training data for custom models.

Confirm you’re using the latest SDK version.

For errors like “no content detected”:

Ensure file format is supported (PDF, JPEG, PNG).

Verify the request payload and headers.

References

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview

Troubleshooting Document Extraction Issues with Azure Document Intelligence

Understanding Document Size Limits in Azure Document Intelligence

Choosing the Right Foundry Tool for Document Processing

Azure AI Services Overview

Please let me know if there are any remaining questions or additional details, I can help with, I’ll be glad to provide further clarification or guidance.

Thankyou!
Ömer Faruk Özsakarya 121 Reputation points

2026-01-14T20:07:15.3233333+00:00

Thank you, Anshika.

Do we need to build a separate custom model for each template type?

When a business user uploads a new document (PDF or Excel), how do we determine which custom model should be applied? Should we attempt to run multiple models and select the one that yields the highest extraction confidence or the most complete set of fields?

From a technology perspective, which approach would you recommend? Is Azure Document Intelligence sufficient on its own, or should we consider using both Document Intelligence and Azure Content Understanding together?
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-14T20:41:26.88+00:00
Hi Ömer Faruk Özsakarya,

Yes, if your templates differ significantly in layout or field structure, creating a separate custom model for each template type. Azure Document Intelligence models are optimized for consistent layouts mixing very different templates in one model can reduce accuracy.

Determining Which Model to Apply for New Documents: Common strategies include:

Classification First: Use a lightweight classifier (e.g., based on document metadata or a simple layout analysis) to route the document to the right model.

Confidence-Based Selection: If classification isn’t feasible, you can run multiple models and pick the one with the highest confidence or most complete extraction. This is more resource-intensive but works when templates are similar.

Technology Recommendation

Azure Document Intelligence Alone: For most structured extraction scenarios, Document Intelligence is sufficient. It provides prebuilt models and custom training for PDFs, images, and Office files.

Combining with Azure Content Understanding: If you need semantic enrichment (e.g., categorization, summarization, or contextual insights beyond field extraction), pairing both services can add value. Document Intelligence handles structured data extraction, while Content Understanding adds meaning and context.

For more details, check the official docs:

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0

https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/

I Hope this helps. Do let me know if you have any further queries.

Thankyou!
omer 0 Reputation points

2026-01-15T09:13:55.8366667+00:00
Thank you very much for the detailed expression. I do not understand the role of Content Understanding. For instance I have the below sample templates

template 1: column names are written. some cells contain multiple values.

some column names are not written. For instance IS 660, Panel are name of the laboratory tests but it is not written in excel sheet.

I can create two custom models in Azure Document Intelligence to extract product, test, result, and batch values from Excel sheets and load them into a normalized table in Databricks.

However, I would like to clarify the following points:

Multi-value cell handling Can a Document Intelligence custom model interpret multiple values within a single cell and output them as multiple records? For example, if a cell contains MgCl2·6H2O, can the model correctly decompose this and produce three logical records? (first screenshot)

Domain-specific semantic understanding Can a Document Intelligence custom model recognize that values such as IS = 660, AOCS Td 1a-64, ISO 6685, ISO 3960, Panel, and GC-MS represent laboratory test names, rather than treating them as free-text strings? (second screenshot)

Role of Azure Content Understanding How does Azure Content Understanding add value in this scenario? Specifically, how does it complement Document Intelligence?

Thanks
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-15T20:11:21.24+00:00
Thanks for sharing the detailed context and examples! Let me address each point:

Multi-value cell handling

Azure Document Intelligence custom models are designed to extract structured fields, but they do not automatically split multiple values within a single cell into separate records. If a cell contains something like MgCl2·6H2O, the model will treat it as one string. To achieve decomposition into multiple logical records, you would need post-processing logic in your pipeline (e.g., using Python or Databricks transformations) after extraction.

Domain-specific semantic understanding

Custom models in Document Intelligence primarily learn patterns from labeled training data. They do not inherently “understand” domain semantics like recognizing IS 660 or GC-MS as lab tests unless you include these examples in your labeled dataset. By providing sufficient annotated samples, the model can learn to classify these as test names rather than generic text. For deeper semantic interpretation, you might consider integrating Azure AI Language or custom NLP models alongside Document Intelligence.

Role of Azure Content Understanding

Content Understanding complements Document Intelligence by adding higher-level semantic enrichment and classification. While Document Intelligence focuses on extracting structured data from documents, Content Understanding can:

Classify documents by type or purpose.

Apply domain-specific taxonomies.

Enable search and retrieval based on meaning rather than raw text.

In your scenario, Content Understanding could help identify that a document relates to “Lab Test Reports” and tag entities like “Test Name” or “Panel,” improving downstream analytics and search.

In Short:

Document Intelligence extracts fields but won’t split multi-value cells automatically use post-processing for that.

Domain-specific recognition depends on your training data; include examples for better accuracy.

Content Understanding adds semantic classification and context, making your solution more intelligent when combined with extraction.

Here are the official Microsoft documentation links:

Custom Model Training in Azure Document Intelligence

Azure Content Understanding Overview

Please let me know if the issue persists after these checks. If you have any remaining questions or need additional details, I’ll be glad to provide further clarification or guidance. If the above steps resolve your issue, kindly confirm.

Thankyou!
omer 0 Reputation points

2026-01-15T20:54:33.71+00:00
Hello,

Thank you for the detailed explanations in the previous responses.

My question is:

Can Azure Content Understanding be used as a single, standalone solution to analyze PDF and Excel files and natively understand domain semantics such as:

Recognizing that IS 660 or AOCS Td la-64 are test methods

Identifying MgCl₂·6H₂O as a parameter/analyte

Understanding mg as a unit of measure

Interpreting 128354349 as a TS Code or business identifier

Without relying on:

Azure Document Intelligence custom models for structured extraction

In other words:

Is Content Understanding designed to both extract structure and perform domain-level semantic interpretation on complex PDFs and Excels?) (as shown in the below end-to-end diagram)

Or is the recommended architecture still to use Document Intelligence for deterministic extraction, with Content Understanding layered optionally for enrichment, classification, or search?

Thank you in advance.
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-19T19:57:55.2966667+00:00
Hi omer,

No Azure Content Understanding is not designed to be a single standalone replacement for Azure Document Intelligence for deterministic structure extraction.

Azure Content Understanding focuses on semantic understanding, enrichment, classification, and reasoning over content, once the text and structure are available. It does not natively perform precise, deterministic extraction of tables, key‑value pairs, or structured fields from complex PDFs or Excel files in the way Azure Document Intelligence does.

For examples like:

Identifying IS 660 or AOCS Td Ia‑64 as test methods

Recognizing MgCl₂·6H₂O as an analyte

Interpreting mg as a unit of measure

Understanding 128354349 as a TS code or business identifier

Content Understanding can interpret and reason about these concepts semantically, but it relies on upstream extraction for reliable text, layout, tables, and structure.

Use Azure Document Intelligence for deterministic extraction (layout, tables, forms, structured fields) from PDFs and Excels

Layer Azure Content Understanding optionally on top for domain reasoning, enrichment, classification, summarization, or search

I Hope this helps. Do let me know if you have any further queries.

Thankyou!
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-20T20:26:15.84+00:00

Hi omer,
Did you get any chance to review the response.

Thankyou!
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-22T19:23:36.03+00:00

Hi omer,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thankyou!

1 answer

Your answer

Ömer Faruk Özsakarya 121 Reputation points

2026-01-14T20:07:15.3233333+00:00

Thank you, Anshika.

Do we need to build a separate custom model for each template type?

When a business user uploads a new document (PDF or Excel), how do we determine which custom model should be applied? Should we attempt to run multiple models and select the one that yields the highest extraction confidence or the most complete set of fields?

From a technology perspective, which approach would you recommend? Is Azure Document Intelligence sufficient on its own, or should we consider using both Document Intelligence and Azure Content Understanding together?
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-14T20:41:26.88+00:00

Hi Ömer Faruk Özsakarya,

Yes, if your templates differ significantly in layout or field structure, creating a separate custom model for each template type. Azure Document Intelligence models are optimized for consistent layouts mixing very different templates in one model can reduce accuracy.

Determining Which Model to Apply for New Documents: Common strategies include:

Classification First: Use a lightweight classifier (e.g., based on document metadata or a simple layout analysis) to route the document to the right model.

Confidence-Based Selection: If classification isn’t feasible, you can run multiple models and pick the one with the highest confidence or most complete extraction. This is more resource-intensive but works when templates are similar.

Technology Recommendation

Azure Document Intelligence Alone: For most structured extraction scenarios, Document Intelligence is sufficient. It provides prebuilt models and custom training for PDFs, images, and Office files.

Combining with Azure Content Understanding: If you need semantic enrichment (e.g., categorization, summarization, or contextual insights beyond field extraction), pairing both services can add value. Document Intelligence handles structured data extraction, while Content Understanding adds meaning and context.

For more details, check the official docs:

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0

https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/

I Hope this helps. Do let me know if you have any further queries.

Thankyou!
omer 0 Reputation points

2026-01-15T09:13:55.8366667+00:00

Thank you very much for the detailed expression. I do not understand the role of Content Understanding. For instance I have the below sample templates

template 1: column names are written. some cells contain multiple values.

some column names are not written. For instance IS 660, Panel are name of the laboratory tests but it is not written in excel sheet.

I can create two custom models in Azure Document Intelligence to extract product, test, result, and batch values from Excel sheets and load them into a normalized table in Databricks.

However, I would like to clarify the following points:

Multi-value cell handling Can a Document Intelligence custom model interpret multiple values within a single cell and output them as multiple records? For example, if a cell contains MgCl2·6H2O, can the model correctly decompose this and produce three logical records? (first screenshot)

Domain-specific semantic understanding Can a Document Intelligence custom model recognize that values such as IS = 660, AOCS Td 1a-64, ISO 6685, ISO 3960, Panel, and GC-MS represent laboratory test names, rather than treating them as free-text strings? (second screenshot)

Role of Azure Content Understanding How does Azure Content Understanding add value in this scenario? Specifically, how does it complement Document Intelligence?

Thanks
omer 0 Reputation points

2026-01-15T20:54:33.71+00:00

Hello,

Thank you for the detailed explanations in the previous responses.

My question is:

Can Azure Content Understanding be used as a single, standalone solution to analyze PDF and Excel files and natively understand domain semantics such as:

Recognizing that IS 660 or AOCS Td la-64 are test methods

Identifying MgCl₂·6H₂O as a parameter/analyte

Understanding mg as a unit of measure

Interpreting 128354349 as a TS Code or business identifier

Without relying on:

Azure Document Intelligence custom models for structured extraction

In other words:

Is Content Understanding designed to both extract structure and perform domain-level semantic interpretation on complex PDFs and Excels?) (as shown in the below end-to-end diagram)

Or is the recommended architecture still to use Document Intelligence for deterministic extraction, with Content Understanding layered optionally for enrichment, classification, or search?

Thank you in advance.
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-19T19:57:55.2966667+00:00

Hi omer,

No Azure Content Understanding is not designed to be a single standalone replacement for Azure Document Intelligence for deterministic structure extraction.

Azure Content Understanding focuses on semantic understanding, enrichment, classification, and reasoning over content, once the text and structure are available. It does not natively perform precise, deterministic extraction of tables, key‑value pairs, or structured fields from complex PDFs or Excel files in the way Azure Document Intelligence does.

For examples like:

Identifying IS 660 or AOCS Td Ia‑64 as test methods

Recognizing MgCl₂·6H₂O as an analyte

Interpreting mg as a unit of measure

Understanding 128354349 as a TS code or business identifier

Content Understanding can interpret and reason about these concepts semantically, but it relies on upstream extraction for reliable text, layout, tables, and structure.

Use Azure Document Intelligence for deterministic extraction (layout, tables, forms, structured fields) from PDFs and Excels

Layer Azure Content Understanding optionally on top for domain reasoning, enrichment, classification, summarization, or search

I Hope this helps. Do let me know if you have any further queries.

Thankyou!
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-20T20:26:15.84+00:00

Hi omer,
Did you get any chance to review the response.

Thankyou!
Anshika Varshney 13,310 Reputation points Microsoft External Staff Moderator

2026-01-22T19:23:36.03+00:00

Hi omer,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thankyou!

Answer 1

For your requirement of extracting information from various Excel and PDF files with changing formats, both Azure Document Intelligence and Content Understanding can be beneficial, but they serve slightly different purposes.

Recommended Approach:

Azure Document Intelligence: This service is designed to extract structured data from documents, including standard forms, invoices, and contracts. It can handle various document formats and provides high-accuracy extraction, making it suitable for your needs where you have different templates. It supports predefined schemas for many templated document types and allows for custom model training to adapt to new templates as they emerge.
Azure Content Understanding: This service is more suited for complex document processing and can handle unstructured documents or those with a large number of variations. It offers richer field extraction and inference capabilities, which can be beneficial if your documents vary significantly in structure. It also supports multimodal inputs, which means it can process images, audio, and video alongside traditional documents.

Pipeline and Architecture:

Data Ingestion: Use Azure Data Factory to automate the ingestion of files into your processing pipeline.
Document Processing: Implement Azure Document Intelligence for initial extraction of structured data from known templates. For new or complex templates, leverage Azure Content Understanding to infer structure and extract data.
Data Normalization: Post-extraction, use Azure Databricks to normalize the extracted data and load it into your target tables.
Continuous Learning: As new templates emerge, you can retrain your Document Intelligence models or adjust your Content Understanding configurations to improve accuracy over time.

Reliability:

Both technologies are reliable for understanding file formats and extracting values, especially when configured properly. Document Intelligence is proven for high-accuracy extraction, while Content Understanding provides flexibility for handling complex variations in document formats.

In summary, for a dynamic environment with various templates, a combination of both Azure Document Intelligence and Content Understanding would likely provide the best results, allowing for both structured and unstructured data extraction as needed.

References:

Share via

Azure Document Intelligence and Content Understanding

1 answer

Recommended Approach:

Pipeline and Architecture:

Reliability:

Your answer