Automate document identification, classification, and search by using Durable Functions

Azure Functions
Azure App Service
Azure AI services
Azure AI Search
Azure Kubernetes Service (AKS)

This article describes an architecture for processing document files that contain multiple documents of various types. It uses the Durable Functions extension of Azure Functions to implement the pipelines that process the files.

Architecture

Diagram of the architecture for identifying, classifying, and searching documents.

Download a Visio file of this architecture.

Workflow

  1. The user provides a document file that the web app uploads. The file contains multiple documents of various types. It can, for instance, be a PDF or multipage TIFF file.

    1. The document file is stored in Azure Blob Storage.
    2. The web app adds a command message to a storage queue to initiate pipeline processing.
  2. Durable Functions orchestration is triggered by the command message. The message contains metadata that identifies the location in Blob Storage of the document file to be processed. Each Durable Functions instance processes only one document file.

  3. The Scan activity function calls the Computer Vision Read API, passing in the location in storage of the document to be processed. Optical character recognition (OCR) results are returned to the orchestration to be used by subsequent activities.

  4. The Classify activity function calls the document classifier service that's hosted in an Azure Kubernetes Service (AKS) cluster. This service uses regular expression pattern matching to identify the starting page of each known document and to calculate how many document types are contained in the document file. The types and page ranges of the documents are calculated and returned to the orchestration.

    Note

    Azure doesn’t offer a service that can classify multiple document types in a single file. This solution uses a non-Azure service that's hosted in AKS.

  5. The Metadata Store activity function saves the document type and page range information in an Azure Cosmos DB store.

  6. The Indexing activity function creates a new search document in the Cognitive Search service for each identified document type and uses the Azure Cognitive Search libraries for .NET to include in the search document the full OCR results and document information. A correlation ID is also added to the search document so that the search results can be matched with the corresponding document metadata from Azure Cosmos DB.

  7. End users can search for documents by contents and metadata. Correlation IDs in the search result set can be used to look up document records that are in Azure Cosmos DB. The records include links to the original document file in Blob Storage.

Components

  • Durable Functions is an extension of Azure Functions that makes it possible for you write stateful functions in a serverless compute environment. In this application, it's used for managing document ingestion and workflow orchestration. It lets you define stateful workflows by writing orchestrator functions that adhere to the Azure Functions programming model. Behind the scenes, the extension manages state, checkpoints, and restarts, leaving you free to focus on the business logic.
  • Azure Cosmos DB is a globally distributed, multi-model database that makes it possible for your solutions to scale throughput and storage capacity across any number of geographic regions. Comprehensive service level agreements (SLAs) guarantee throughput, latency, availability, and consistency.
  • Azure Storage is a set of massively scalable and secure cloud services for data, apps, and workloads. It includes Blob Storage, Azure Files, Azure Table Storage, and Azure Queue Storage.
  • Azure App Service provides a framework for building, deploying, and scaling web apps. The Web Apps feature is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. With Web Apps, you can develop in .NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. Applications easily run and scale in Windows and Linux-based environments.
  • Azure Cognitive Services provides intelligent algorithms to see, hear, speak, understand, and interpret your user needs by using natural methods of communication.
  • Azure Cognitive Search provides a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.
  • AKS is a highly available, secure, and fully managed Kubernetes service. AKS makes it easy to deploy and manage containerized applications.

Alternatives

Scenario details

This article describes an architecture that uses Durable Functions to implement automated pipelines for processing document files that contain multiple documents of various types. The pipelines identify the documents in a document file, classify them by type, and store information that can be used in subsequent processing.

Many companies need to manage and process document files that contain documents that have been scanned in bulk and that can contain several different document types. Typically the document files are PDFs or multi-page TIFF images. These files usually originate from outside the organization, and the receiving company doesn't control the content.

Given these constraints, organizations have been forced to build their own document parsing solutions that can include custom technology and manual processes. A solution can include human intervention for splitting out individual document types into their own files and adding classifications qualifiers for each document.

Many of these custom solutions are based on the state machine workflow pattern and use database systems for persisting workflow state, with polling services that check for the states that they're responsible for processing. Maintaining and enhancing such solutions can be difficult and time consuming.

Organizations are looking for reliable, scalable, and resilient solutions for processing and managing document identification and classification for the types of files their organization uses. This includes processing millions of documents per day with full observability into the success or failure of the processing pipeline.

Potential use cases

This solution applies to many areas:

  • Title reporting. Many government agencies and municipalities manage paper records that haven't been migrated to digital form. An effective automated solution can generate a file that contains all the documents that are required to satisfy a document request.
  • Maintenance records. Aircraft, locomotive, and machinery maintenance records still exist in paper form that require scanning and sending to outside organizations.
  • Permit processing. City and county permitting departments still maintain paper documents that are generated for permit inspection reporting. The ability to take a picture of several inspection documents and automatically identify, classify, and search across these records can be highly beneficial.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework

Reliability

Reliability ensures that your application can meet the commitments that you make to your customers. For more information, see Overview of the reliability pillar.

A reliable workload is one that's both resilient and available. Resiliency is the ability of the system to recover from failures and continue to function. The goal of resiliency is to return the application to a fully functioning state after a failure occurs. Availability is a measure of whether your users can access your workload when they need to.

For reliability information about solution components, see the following resources:

Cost optimization

Cost optimization is about reducing unnecessary expenses and improving operational efficiencies. For more information, see Overview of the cost optimization pillar.

The most significant costs for this architecture will potentially come from the storage of image files in the storage account, Cognitive Services image processing, and index capacity requirements in the Azure Cognitive Search service.

Costs can be optimized by right sizing the storage account by using reserved capacity and lifecycle policies, proper Azure Cognitive Search planning for regional deployments and operational scale up scheduling, and using commitment tier pricing that's available for the Computer Vision – OCR service to manage predictable costs.

Here are some guidelines for optimizing costs:

  • Use the pay-as-you-go strategy for your architecture and scale out as needed rather than investing in large-scale resources at the start.
  • Consider opportunity costs in your architecture, and the balance between first-mover advantage versus fast follow. Use the pricing calculator to estimate the initial cost and operational costs.
  • Establish policies, budgets, and controls that set cost limits for your solution.

Performance efficiency

Performance efficiency is the ability of your workload to scale in an efficient manner to meet the demands that users place on it. For more information, see Performance efficiency pillar overview.

Periods when this solution processes high volumes can expose performance bottlenecks. Make sure that you understand and plan for the scaling options for Azure Functions, Cognitive Services autoscaling, and Azure Cosmos DB partitioning to ensure proper performance efficiency for your solution.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Introductory articles:

Product documentation: