How to read data from a local pdf using Document intelligence studio (layout)

Question

How to read data from a local pdf using Document intelligence studio (layout)

Aditya Kommu 0

Hello All,

I am trying to read data from a pdf in below code it used formurl. I want to read data from local How to do it? Screenshot 2024-04-28 at 10.26.39 PM

dupammi 8,615 Microsoft External Staff

Hi @Aditya Kommu

Thank you for reaching out to us with your query about reading the data from a local pdf using Document intelligence studio (layout). I'd be happy to help you with that.

Here is an example code that uses the Azure Form Recognizer SDK to analyze a document from local path.

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
import base64  
# Set the endpoint and key 
endpoint = "YOUR_ENDPOINT" 
key = "YOUR_KEY"  
def analyze_layout_local_file(file_path):     with open(file_path, "rb") as f:
        base64_encoded_pdf = base64.b64encode(f.read()).decode("utf-8")

    analyze_request = {
        "base64Source": base64_encoded_pdf
    }

    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )

    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout", analyze_request=analyze_request
    )

    result = poller.result()

    if result.styles and any([style.is_handwritten for style in result.styles]):
        print("Document contains handwritten content")
    else:
        print("Document does not contain handwritten content")

    for page in result.pages:
        print(f"----Analyzing layout from page #{page.page_number}----")
        print(
            f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}"
        )

        if page.lines:
            for line in page.lines:
                print(f"Content: {line.content}")  # Print content of the line

# Call the function to analyze the layout of the locally downloaded file analyze_layout_local_file("YOUR_FULL_LOCAL_PATH_TO_PDF_FILE")

The above repro code is using Azure AI Document Intelligence to analyze the layout of a local PDF file and then printing the layout information. The analyze_layout_local_file function uses the DocumentIntelligenceClient class from the azure.ai.documentintelligence module to analyze the layout of the PDF file. The begin_analyze_document method of the DocumentIntelligenceClient class is used to start the analysis process, and the result method of the AsyncPoller class is used to retrieve the analysis results.

Once the analysis is complete, the function iterates over the pages in the PDF file and prints the layout information for each page. Specifically, it prints the page number, width, height, and unit of measurement for each page, as well as the content of each line in the PDF file.

Output:
User's image

I hope the provided information helps in the debugging of your code. Thank you.

Aditya Kommu 0 Reputation points

2024-04-29T14:39:01.5733333+00:00

Thank you, but this is not extracting in a meaninful manner, I mean
1)tables shodul be mapped correctly to key and values

2)any signatures should be given

Any help on those lines?
dupammi 8,615 Reputation points Microsoft External Staff

2024-04-29T16:09:03.1966667+00:00

Hi @Aditya Kommu

Thank you for the update. I understand that your initial query regarding analyzing local files has been resolved, and you are now interested in analyzing documents that contain tables, signatures, and other elements.

To detect the signatures from documents you need to train by using the Custom template model. Please recheck the model you are using Custom template only.

I hope this helps.

Thank you.
Aditya Kommu 0 Reputation points

2024-04-30T00:56:49.1466667+00:00

I am actually looking for table extrcttion, in a proper manner. While extrating cureent page all data is displayed line by line . I need a proper mapping for tables
dupammi 8,615 Reputation points Microsoft External Staff

2024-04-30T01:18:51.7633333+00:00

Hi @Aditya Kommu

Thank you for clarifying your requirements further. It appears there's a shift in focus from simply reading data from local PDFs to more advanced extraction, particularly tables, with proper mapping and handling of signatures.

To achieve table extraction in a structured manner, you'll likely need to utilize a more customized approach. As suggested earlier, leveraging the Custom template model in Document Intelligence Studio could be the way forward for both table extraction and signature handling. By training the model with relevant examples, you can teach it to accurately identify and extract tabular data & handle the signatures.

I hope you understand. Thank you.
Campbell, Tehron 10 Reputation points

2024-06-24T12:56:07.17+00:00

@dupammi Are you able to show this solution for C#? Thank you
Carlos Jimenez Uribe-echeverría 0 Reputation points

2024-06-25T22:30:19.2433333+00:00
@dupammi thanks for the above explanation and accompanying code. It's all clear, but I was left with one doubt: is there any reason why you did the b64 encoding of the PDF text in line base64.b64encode(f.read()).decode("utf-8")? I did the test with a sample PDF of mine, and the results (result.content) were identical when just processing the PDF without such encoding, like this:

from azure.ai.documentintelligence.models import AnalyzeDocumentRequest with open(file_path, "rb") as f: source_bytes = f.read() poller = di_client.begin_analyze_document( model_id="prebuilt-layout", analyze_request=AnalyzeDocumentRequest(bytes_source=source_bytes) )

But, of course, I may be missing something. Thanks in advance.
dupammi 8,615 Reputation points Microsoft External Staff

2024-06-26T01:59:20.36+00:00

@Carlos Jimenez Uribe-echeverría

You are correct that the base64 encoding is not strictly necessary for the Document Intelligence API to process the PDF. The approach you mentioned, where the PDF is read directly as bytes and passed to the begin_analyze_document method, is indeed valid and might even be simpler and more efficient. Thank you.
priyanka k 0 Reputation points

2025-02-18T05:34:18.3666667+00:00

how to declare file path

1 answer

Your answer

Aditya Kommu 0 Reputation points

2024-04-29T14:39:01.5733333+00:00

Thank you, but this is not extracting in a meaninful manner, I mean
1)tables shodul be mapped correctly to key and values

2)any signatures should be given

Any help on those lines?
dupammi 8,615 Reputation points Microsoft External Staff

2024-04-29T16:09:03.1966667+00:00

Hi @Aditya Kommu

Thank you for the update. I understand that your initial query regarding analyzing local files has been resolved, and you are now interested in analyzing documents that contain tables, signatures, and other elements.

To detect the signatures from documents you need to train by using the Custom template model. Please recheck the model you are using Custom template only.

I hope this helps.

Thank you.
Aditya Kommu 0 Reputation points

2024-04-30T00:56:49.1466667+00:00

I am actually looking for table extrcttion, in a proper manner. While extrating cureent page all data is displayed line by line . I need a proper mapping for tables
dupammi 8,615 Reputation points Microsoft External Staff

2024-04-30T01:18:51.7633333+00:00

Hi @Aditya Kommu

Thank you for clarifying your requirements further. It appears there's a shift in focus from simply reading data from local PDFs to more advanced extraction, particularly tables, with proper mapping and handling of signatures.

To achieve table extraction in a structured manner, you'll likely need to utilize a more customized approach. As suggested earlier, leveraging the Custom template model in Document Intelligence Studio could be the way forward for both table extraction and signature handling. By training the model with relevant examples, you can teach it to accurately identify and extract tabular data & handle the signatures.

I hope you understand. Thank you.
Campbell, Tehron 10 Reputation points

2024-06-24T12:56:07.17+00:00

@dupammi Are you able to show this solution for C#? Thank you
Carlos Jimenez Uribe-echeverría 0 Reputation points

2024-06-25T22:30:19.2433333+00:00

@dupammi thanks for the above explanation and accompanying code. It's all clear, but I was left with one doubt: is there any reason why you did the b64 encoding of the PDF text in line base64.b64encode(f.read()).decode("utf-8")? I did the test with a sample PDF of mine, and the results (result.content) were identical when just processing the PDF without such encoding, like this:

from azure.ai.documentintelligence.models import AnalyzeDocumentRequest with open(file_path, "rb") as f: source_bytes = f.read() poller = di_client.begin_analyze_document( model_id="prebuilt-layout", analyze_request=AnalyzeDocumentRequest(bytes_source=source_bytes) )

But, of course, I may be missing something. Thanks in advance.
dupammi 8,615 Reputation points Microsoft External Staff

2024-06-26T01:59:20.36+00:00

@Carlos Jimenez Uribe-echeverría

You are correct that the base64 encoding is not strictly necessary for the Document Intelligence API to process the PDF. The approach you mentioned, where the PDF is read directly as bytes and passed to the begin_analyze_document method, is indeed valid and might even be simpler and more efficient. Thank you.
priyanka k 0 Reputation points

2025-02-18T05:34:18.3666667+00:00

how to declare file path

Answer 1

Ram 20

@dupammi Can you post this solution for C#? Thank you

Share via

How to read data from a local pdf using Document intelligence studio (layout)

1 answer

Your answer