How to Batch Process Upload Invoices in Prebuilt model?

Hadi Abou-Ghaida 0 Reputation points
2024-08-08T23:25:02.3933333+00:00

A user may upload up to 100 invoices at a time in our software. Currently, we are going through them one at a time, which takes too long.

How do we batch upload to improve the time taken?

Batch upload is preferable if it can reduce the time taken dramatically, although alternatively, please advise on whether concurrency is possible to process invoices at the same time.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. Konstantinos Passadis 19,586 Reputation points MVP
    2024-08-08T23:45:38.9966667+00:00

    Hello

    The best way is to use the Document Intelligence API

    https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/document-intelligence/concept-invoice.md

    https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-batch-analysis?view=doc-intel-4.0.0

    The batch analysis API allows you to bulk process multiple documents using one asynchronous request. Rather than having to submit documents individually and track multiple request IDs, you can analyze a collection of invoices, a series of loan documents, or a group of custom model training documents simultaneously.

    • To utilize batch analysis, you need an Azure Blob storage account with specific containers for both your source documents and the processed outputs.
    • Upon completion, the batch operation result lists all of the individual documents processed with their status, such as succeeded, skipped, or failed.
    • The Batch API preview version is available via pay-as-you-go pricing.

    --

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    1 person found this answer helpful.

  2. Konstantinos Passadis 19,586 Reputation points MVP
    2024-08-09T00:16:23.9533333+00:00

    Hello @Hadi Abou-Ghaida

    Thank you for your input

    Yes, using the batch analysis feature with the prebuilt Invoice model will reduce the time taken to process large numbers of invoices substantially. It achieves this through concurrent processing, leveraging Azure's cloud infrastructure to handle multiple documents at the same time, rather than sequentially. The exact time savings will depend on the concurrency level supported by the API and the configuration of your Azure resources.

    The Document Intelligence API's Batch Analysis feature is designed specifically to handle multiple documents in a single asynchronous request. This means that instead of processing each invoice sequentially (which, as you mentioned, can take 5-6 seconds per invoice), the batch processing allows for the simultaneous handling of multiple invoices. This can significantly reduce the overall processing time for large sets of documents.

    The batch analysis doesn't process the documents strictly one after the other (sequentially). Instead, it leverages Azure's capabilities to process documents in parallel, effectively reducing the time it takes to complete the batch as a whole. So, instead of taking up to an hour for hundreds of invoices, the time will be reduced depending on the number of documents that can be processed concurrently by the API.

    --

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards


  3. Konstantinos Passadis 19,586 Reputation points MVP
    2024-08-09T01:00:56.0233333+00:00

    Hello

    have a look at the Table

    https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview?view=doc-intel-4.0.0n Azure AI Document Intelligence, three of the prebuilt models are for general document analysis:

    • Read
    • General document
    • Layout

    So you can use a Read prebuilt model especially if you have PDF

    Or you can train you own model , which in most cases is the better path!

    --

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards


  4. Konstantinos Passadis 19,586 Reputation points MVP
    2024-08-09T13:53:05.35+00:00

    Hello @Hadi Abou-Ghaida

    This is an altered sample code for Batch analysis

    The original code is https://github.com/Azure-Samples/document-intelligence-code-samples/blob/main/Python(v4.0)/Prebuilt_model/sample_analyze_invoices.py

    import os

    from azure.core.credentials import AzureKeyCredential

    from azure.ai.documentintelligence import DocumentIntelligenceClient

    from azure.ai.documentintelligence.models import AnalyzeResult, AnalyzeDocumentRequest

    def analyze_invoices(invoice_urls):

    endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]

    key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]

    document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

    # Create a batch of AnalyzeDocumentRequests

    requests = [AnalyzeDocumentRequest(url_source=url) for url in invoice_urls]

    # Begin analyzing the batch of invoices

    poller = document_intelligence_client.begin_analyze_document(

    "prebuilt-invoice",

    requests

    )

    invoices: AnalyzeResult = poller.result()

    if invoices.documents:

    for idx, invoice in enumerate(invoices.documents):

    print(f"--------Analyzing invoice #{idx + 1}--------")

    if invoice.fields:

    vendor_name = invoice.fields.get("VendorName")

    if vendor_name:

    print(f"Vendor Name: {vendor_name.get('content')} has confidence: {vendor_name.get('confidence')}")

    # Repeat similar blocks for other fields as in your original code...

    # (Omitted here for brevity)

    if __name__ == "__main__":

    from azure.core.exceptions import HttpResponseError

    from dotenv import find_dotenv, load_dotenv

    try:

    load_dotenv(find_dotenv())

    # List of URLs for the invoices you want to process in a batch

    invoice_urls = [

    "https://github.com/Azure-Samples/cognitive-services-REST-api-samples/raw/master/curl/form-recognizer/rest-api/invoice.pdf",

    "https://another-url-for-invoice.com/invoice2.pdf",

    # Add more URLs as needed

    ]

    analyze_invoices(invoice_urls)

    except HttpResponseError as error:

    if error.error is not None:

    if error.error.code == "InvalidImage":

    print(f"Received an invalid image error: {error.error}")

    if error.error.code == "InvalidRequest":

    print(f"Received an invalid request error: {error.error}")

    raise

    if "Invalid request".casefold() in error.message.casefold():

    print(f"Uh-oh! Seems there was an invalid request: {error}")

    raise

    My suggestion is to try . !

    --

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.