Content Understanding Capabilities (preview)

Article
02/26/2025

Important

Azure AI Content Understanding is available in preview. Public preview releases provide early access to features that are in active development.
Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).
For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Content Understanding provides an advanced approach to processing and interpreting vast amounts of unstructured data. It offers various capabilities that accelerate time-to-value, reducing the time required to derive meaningful insights. By generating outputs that seamlessly integrate into analytical workflows and Retrieval-Augmented Generation (RAG) applications, it enhances data-driven decision-making and boosts overall productivity.

Overview of Key Capabilities in Content Understanding

Multimodal Data Ingestion

Content Understanding delivers a unified solution for processing diverse data types - documents, text, images, audio, and video - through an intelligent pipeline that transforms unstructured content into structured, analyzable formats. This consolidated approach eliminates the complexity of managing separate Azure resources for speech, vision, and document processing.

The service employs a customizable dual-pipeline architecture that combines content extraction and field extraction capabilities. Content extraction provides foundational structuring of raw data, while field extraction applies schema-based analysis to derive specific insights. This integrated approach streamlines workflows, reduces operational overhead, and enables sophisticated analysis across multiple modalities through a single, cohesive interface.

Content Extraction

Content extraction in Content Understanding is a powerful feature that transforms unstructured data into structured data, powering advanced AI processing capabilities. The structured data enables efficient downstream processing while maintaining contextual relationships in the source content.

Content extraction provides foundational data that grounds the generative capabilities of Field Extraction, offering essential context about the input content. Users find content extraction invaluable for converting diverse data formats into a structured format. This capability excels in scenarios requiring:

Document digitization, indexing, and retrieval by structure
Audio/video transcription
Metadata generation at scale

Content Understanding enhances its core extraction capabilities through optional add-on features that provide deeper content analysis. These add-ons can extract ancillary elements like layout information, speaker roles, and face grouping. While some add-ons can incur added costs, they can be selectively enabled based on your specific requirements to optimize both functionality and cost-efficiency. The modular nature of these add-on features allows for customized processing pipelines tailored to your use case.

The following section details the content extraction capabilities and optional add-on features available for each supported modality. Select your target modality from the following tabs and view its specific capabilities.

Content Extraction	Add-on Capabilities
• `Optical Character Recognition (OCR)`: Extract printed and handwritten text from documents in various file formats, converting it into structured data.	• `Layout`: Extracts layout information such as paragraphs, sections, and tables • `Barcode`: Identifies and decodes all barcodes in the documents. • `Formula`: Recognizes all identified mathematical equations from the documents.

Content Extraction Add-on Capabilities

• Transcription: Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Customizable fields can be generated from transcription data. Sentence-level and word-level timestamps are available upon request.
• Diarization: Distinguishes between speakers in a conversation, attributing parts of the transcript to specific speakers.
• Language detection: Automatically detects the language spoken in the audio to be processed. • Speaker role detection: Identifies speaker roles based on diarization results and replaces generic labels like "Speaker 1" with specific role names, such as "Agent" or "Customer."

Content Extraction	Add-on Capabilities
• `Transcription`: Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Customizable fields can be generated from transcription data. Sentence-level and word-level timestamps are available upon request. • `Diarization`: Distinguishes between speakers in a conversation, attributing parts of the transcript to specific speakers. • `Language detection`: Automatically detects the language spoken in the audio to be processed.	• `Speaker role detection`: Identifies speaker roles based on diarization results and replaces generic labels like "Speaker 1" with specific role names, such as "Agent" or "Customer."

Content Extraction Add-on Capabilities

• Transcription: Converts speech to structured, searchable text via Azure AI Speech, allowing users to specify recognition languages.
• Shot Detection: Identifies segments of the video aligned with shot boundaries where possible, allowing for precise editing and repackaging of content with breaks exactly on shot boundaries.
• Key Frame Extraction: Extracts key frames from videos to represent each shot completely, ensuring each shot has enough key frames to enable Field Extraction to work effectively. • Face Grouping: Grouped faces appearing in a video to extract one representative face image for each person and provides segments where each one is present. The grouped face data is available as metadata and can be used to generate customized metadata fields. This feature is limited access and involves face identification and grouping; customers need to register for access at Face Recognition.

Content Extraction	Add-on Capabilities
• `Transcription`: Converts speech to structured, searchable text via Azure AI Speech, allowing users to specify recognition languages. • `Shot Detection`: Identifies segments of the video aligned with shot boundaries where possible, allowing for precise editing and repackaging of content with breaks exactly on shot boundaries. • `Key Frame Extraction`: Extracts key frames from videos to represent each shot completely, ensuring each shot has enough key frames to enable Field Extraction to work effectively.	• `Face Grouping`: Grouped faces appearing in a video to extract one representative face image for each person and provides segments where each one is present. The grouped face data is available as metadata and can be used to generate customized metadata fields. This feature is limited access and involves face identification and grouping; customers need to register for access at Face Recognition.

Field Extraction

Field extraction in Content Understanding uses generative AI models to define schemas that extract, infer, or abstract information from various data types into structured outputs. This capability is powerful because by defining schemas with natural language field descriptions it eliminates the need for complex prompt engineering, making it accessible for users to create standardized outputs.

Field extraction is optimized for scenarios requiring:

Consistent metadata extraction across content types
Workflow automation with structured output
Compliance monitoring and validation

The value lies in its ability to handle multiple content types (text, audio, video, images) while maintaining accuracy and scalability through AI-powered schema extraction and confidence scoring.

Each modality supports specific generation approaches optimized for that content type. Review the following tabs to understand the generation capabilities and methods available for your target modality.

Supported generation methods
• Extract: In document, users can extract field values from input content, such as dates from receipts or item details from invoices.

Illustration of Document extraction method workflow.

Supported generation methods
• Generate: In images, users can derive values from the input content, such as generating titles, descriptions, and summaries for figures and charts. • Classify: In images, users can categorize elements from the input content, such as identifying different types of charts like histograms, bar graphs, etc.

Illustration of Image Generation and Classification workflow.

Supported generation methods
• Generate: In audio, users can derive values from the input content, such as conversation summaries and topics. • Classify: In audio, users can categorize values from the input content, such as determining the sentiment of a conversation (positive, neutral, or negative).

Illustration of Audio Generation and Classification workflow.

Supported generation methods
• Generate: In video, users can derive values from the input content, such as summaries of video segments and product characteristics. • Classify: In video, users can categorize values from the input content, such as determining the sentiment of conversations (positive, neutral, or negative).

Illustration of Video Generation and Classification workflow.

Follow our quickstart guide to build your first schema.

Grounding and Confidence Scores

Content Understanding ensures that the results from field and content extraction are precisely aligned with the input content. It also provides confidence scores for the extracted data, enhancing the reliability of automation and validation processes.

Analyzers

Analyzers are the core processing units in Content Understanding that define how your content should be processed and what insights should be extracted. Think of an analyzer as a custom pipeline that combines:

Content extraction configurations - determining what foundational elements to extract.
Field extraction schemas - specifying what insights to generate from the content.

Key benefits of analyzers include:

Consistency: Analyzers ensure uniform processing across all content by applying the same extraction rules and schemas, delivering reliable and predictable results.
Scalability: Once configured, analyzers can handle large volumes of content through API integration, making them ideal for production scenarios.
Reusability: A single analyzer can be reused across multiple workflows and applications, reducing development overhead.
Customization: Start with prebuilt templates. You can then enhance their functionality with analyzers that can be fully customized to match your specific business requirements and use cases.

For example, you might create an analyzer for processing customer service calls that combines audio transcription (content extraction) with sentiment analysis and topic classification (field extraction). This analyzer can then consistently process thousands of calls, providing structured insights for your customer experience analytics.

To get started, you can follow our guide for building your first analyzer.

Best Practices

For guidance on optimizing your Content Understanding implementations, including schema design tips, see our detailed Best practices guide. This guide helps you maximize the value of Content Understanding while avoiding common pitfalls.

Input requirements

For detailed information on supported input document formats, refer to our Service quotas and limits page.

Supported languages and regions

For a detailed list of supported languages and regions, visit our Language and region support page.

Data privacy and security

Developers using Content Understanding should review Microsoft's policies on customer data. For more information, visit our Data, protection, and privacy page.

Next steps

Try processing your document content using Content Understanding in Azure .
Learn to analyze content analyzer templates.
Review code sample: analyzer templates.
Take a look at our glossary

Additional resources

Documentation

Best practices for using Content Understanding - Azure AI services

Learn how to best use Azure AI Content Understanding for document, image, video, and audio file content and field extractions.
What is Azure AI Content Understanding? - Azure AI services

Learn about Azure AI Content Understanding solutions, processes, workflows, use-cases, and field extractions.
Azure AI Content Understanding audio overview - Azure AI services

Learn about Azure AI Content Understanding audio solutions
Azure AI Content Understanding documentation

Content Understanding is a solution that analyzes and comprehends various media content—such as audio, video, text, and images—transforming it into structured, organized, and searchable data.
Azure AI Content Understanding analyzer templates - Azure AI services

Learn about Azure AI Content Understanding analyzer templates.
Azure AI Content Understanding image overview - Azure AI services

Learn how to use Azure AI Content Understanding image solutions
Transparency Note and use cases for Content Understanding - Azure AI services

This article explains Content Understanding Responsible AI basics, use cases, terms, and provides guidelines for responsible usage.
Azure AI Content Understanding document overview - Azure AI services

Learn about Azure AI Content Understanding document solutions.

Training

Module

Analyze content with Azure AI Content Understanding - Training

Use Azure AI Content Understanding for multimodal content analysis and information extraction.

Certification

Microsoft Certified: Azure AI Fundamentals - Certifications

Demonstrate fundamental AI concepts related to the development of software and services of Microsoft Azure to create AI solutions.

Events

Build AI Apps and Agents

Mar 17, 9 PM - Mar 21, 10 AM

Join the meetup series to build scalable AI solutions based on real-world use cases with fellow developers and experts.

Share via