Edit

Share via


Automate document classification in Azure

Azure Functions
Azure OpenAI Service
Azure AI services
Azure AI Search
Azure AI Document Intelligence

This article describes an architecture that you can use to process various documents. The architecture uses the durable functions feature of Azure Functions to implement pipelines. The pipelines process documents via Azure AI Document Intelligence for document splitting, named entity recognition (NER), and classification. Document content and metadata are used for retrieval-augmented generation (RAG)-based natural language processing (NLP).

Architecture

Diagram that shows an architecture to identify, classify, and search documents.

Download a Visio file of this architecture.

Workflow

  1. A user uploads a document file to a web app. The file contains multiple embedded documents of various types, such as PDF or multiple-page Tag Image File Format (TIFF) files. The document file is stored in Azure Blob Storage (1a). To initiate pipeline processing, the web app adds a command message to an Azure Service Bus queue (1b).

  2. The command message triggers the durable functions orchestration. The message contains metadata that identifies the Blob Storage location of the document file to be processed. Each durable functions instance processes only one document file.

  3. The analyze activity function calls the Document Intelligence Analyze Document API, which passes the storage location of the document file to be processed. The analyze function reads and identifies each document within the document file. This function returns the name, type, page ranges, and content of each embedded document to the orchestration.

  4. The metadata store activity function saves the document type, location, and page range information for each document in an Azure Cosmos DB store.

  5. The embedding activity function uses Semantic Kernel to chunk each document and create embeddings for each chunk. Embeddings and associated content are sent to Azure AI Search and stored in a vector-enabled index. A correlation ID is also added to the search document so that the search results can be matched with the corresponding document metadata from Azure Cosmos DB.

  6. Semantic Kernel retrieves embeddings from the AI Search vector store for NLP.

  7. Users can chat with their data by using NLP. This conversation is powered by grounded data retrieved from the vector store. To look up document records that are in Azure Cosmos DB, users use correlation IDs included in the search result set. The records include links to the original document file in Blob Storage.

Components

  • Durable functions is a feature of Azure Functions that you can use to write stateful functions in a serverless compute environment. In this architecture, a message in a Service Bus queue triggers a durable functions instance. This instance then initiates and orchestrates the document-processing pipeline.

  • Azure Cosmos DB is a globally distributed, multiple-model database that you can use in your solutions to scale throughput and storage capacity across any number of geographic regions. Comprehensive service-level agreements (SLAs) guarantee throughput, latency, availability, and consistency. This architecture uses Azure Cosmos DB as the metadata store for the document classification information.

  • Azure Storage is a set of massively scalable and secure cloud services for data, apps, and workloads. It includes Blob Storage, Azure Files, Azure Table Storage, and Azure Queue Storage. This architecture uses Blob Storage to store the document files that the user uploads and that the durable functions pipeline processes.

  • Service Bus is a fully managed enterprise message broker with message queues and publish-subscribe topics. This architecture uses Service Bus to trigger durable functions instances.

  • Azure App Service provides a framework to build, deploy, and scale web apps. The Web Apps feature of App Service is an HTTP-based tool that you can use to host web applications, REST APIs, and mobile back ends. Use Web Apps to develop in .NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. Applications can easily run and scale in Windows-based and Linux-based environments. In this architecture, users interact with the document-processing system through an App Service-hosted web app.

  • Document Intelligence is a service that you can use to extract insights from your documents, forms, and images. This architecture uses Document Intelligence to analyze the document files and extract the embedded documents along with content and metadata information.

  • AI Search provides a rich search experience for private, diverse content in web, mobile, and enterprise applications. This architecture uses AI Search vector storage to index embeddings of the extracted document content and metadata information so that users can search and retrieve documents by using NLP.

  • Semantic Kernel is a framework that you can use to integrate large language models (LLMs) into your applications. This architecture uses Semantic Kernel to create embeddings for the document content and metadata information, which are stored in AI Search.

  • Azure OpenAI Service provides access to OpenAI's powerful models. This architecture uses Azure OpenAI to provide a natural language interface for users to interact with the document-processing system.

Alternatives

  • To facilitate global distribution, this solution stores metadata in Azure Cosmos DB. Azure SQL Database is another persistent storage option for document metadata and information.

  • To trigger durable functions instances, you can use other messaging platforms, including Azure Event Grid.

  • Semantic Kernel is one of several options for creating embeddings. You can also use Azure Machine Learning or Azure AI services to create embeddings.

  • To provide a natural language interface for users, you can use other language models within Azure AI Foundry. The platform supports various models from different providers, including Mistral, Meta, Cohere, and Hugging Face.

Scenario details

In this architecture, the pipelines identify the documents in a document file, classify them by type, and store information to use in subsequent processing.

Many companies need to manage and process documents that they scan in bulk and that contain several different document types, such as PDFs or multiple-page TIFF images. These documents might originate from outside the organization, and the receiving company doesn't control the format.

Because of these constraints, organizations must build their own document-parsing solutions that can include custom technology and manual processes. For example, someone might manually separate individual document types and add classification qualifiers for each document.

Many of these custom solutions are based on the state machine workflow pattern. The solutions use database systems to persist workflow state and use polling services that check for the states that they need to process. Maintaining and enhancing these solutions can increase complexity and effort.

Organizations need reliable, scalable, and resilient solutions to process and manage document identification and classification for their organization's document types. This solution can process millions of documents each day with full observability into the success or failure of the processing pipeline.

NLP allows users to interact with the system in a conversational manner. Users can ask questions about the documents and receive answers based on the content of the documents.

Potential use cases

You can use this solution to:

  • Report titles. Many government agencies and municipalities manage paper records that don't have a digital form. An effective automated solution can generate a file that contains all the documents that you need to satisfy a document request.

  • Manage maintenance records. You might need to scan and send paper records, such as aircraft, locomotive, and machinery maintenance records, to outside organizations.

  • Process permits. City and county permitting departments maintain paper documents that they generate for permit inspection reporting. You can take a picture of several inspection documents and automatically identify, classify, and search across these records.

  • Analyze planograms. Retail and consumer goods companies manage inventory and compliance through store shelf planogram analysis. You can take a picture of a store shelf and extract label information from varying products to automatically identify, classify, and quantify the product information.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Well-Architected Framework.

Reliability

Reliability helps ensure that your application can meet the commitments that you make to your customers. For more information, see Design review checklist for Reliability.

A reliable workload has both resiliency and availability. Resiliency is the ability of the system to recover from failures and continue to function. The goal of resiliency is to return the application to a fully functioning state after a failure occurs. Availability measures whether your users can access your workload when they need to.

To ensure reliability and availability to Azure OpenAI endpoints, consider using a generative API gateway for multiple Azure OpenAI deployments or instances. The back-end load balancer supports round-robin, weighted, and priority-based load balancing. This feature gives you flexibility to define an Azure OpenAI load distribution strategy that meets your specific requirements.

For more information about reliability in solution components, see SLA information for Azure online services.

Cost Optimization

Cost Optimization focuses on ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Design review checklist for Cost Optimization.

The most significant costs for this architecture are the Azure OpenAI model token usage, Document Intelligence image processing, and index capacity requirements in AI Search.

To optimize costs:

Performance Efficiency

Performance Efficiency refers to your workload's ability to scale to meet user demands efficiently. For more information, see Design review checklist for Performance Efficiency.

This solution can expose performance bottlenecks when you process high volumes of data. To ensure proper performance efficiency for your solution, make sure that you understand and plan for Azure Functions scaling options, AI services autoscaling, and Azure Cosmos DB partitioning.

Azure OpenAI PTUs provide guaranteed performance and availability, along with global deployments. These deployments use the Azure global infrastructure to dynamically route customer traffic to the datacenter that has the best availability for the customer's inference requests.

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal author:

Other contributors:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps

Introductory articles:

Product documentation: