How to read data from a local pdf using Document intelligence studio (layout)
Hello All,
I am trying to read data from a pdf in below code it used formurl. I want to read data from local How to do it?
Azure AI Document Intelligence
-
dupammi • 8,540 Reputation points • Microsoft Vendor
2024-04-29T07:01:14.99+00:00 Thank you for reaching out to us with your query about reading the data from a local pdf using Document intelligence studio (layout). I'd be happy to help you with that.
Here is an example code that uses the Azure Form Recognizer SDK to analyze a document from local path.
from azure.core.credentials import AzureKeyCredential from azure.ai.documentintelligence import DocumentIntelligenceClient import base64 # Set the endpoint and key endpoint = "YOUR_ENDPOINT" key = "YOUR_KEY" def analyze_layout_local_file(file_path): with open(file_path, "rb") as f: base64_encoded_pdf = base64.b64encode(f.read()).decode("utf-8") analyze_request = { "base64Source": base64_encoded_pdf } document_intelligence_client = DocumentIntelligenceClient( endpoint=endpoint, credential=AzureKeyCredential(key) ) poller = document_intelligence_client.begin_analyze_document( "prebuilt-layout", analyze_request=analyze_request ) result = poller.result() if result.styles and any([style.is_handwritten for style in result.styles]): print("Document contains handwritten content") else: print("Document does not contain handwritten content") for page in result.pages: print(f"----Analyzing layout from page #{page.page_number}----") print( f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}" ) if page.lines: for line in page.lines: print(f"Content: {line.content}") # Print content of the line # Call the function to analyze the layout of the locally downloaded file analyze_layout_local_file("YOUR_FULL_LOCAL_PATH_TO_PDF_FILE")
The above repro code is using Azure AI Document Intelligence to analyze the layout of a local PDF file and then printing the layout information. The
analyze_layout_local_file
function uses theDocumentIntelligenceClient
class from theazure.ai.documentintelligence
module to analyze the layout of the PDF file. Thebegin_analyze_document
method of theDocumentIntelligenceClient
class is used to start the analysis process, and theresult
method of theAsyncPoller
class is used to retrieve the analysis results.Once the analysis is complete, the function iterates over the pages in the PDF file and prints the layout information for each page. Specifically, it prints the page number, width, height, and unit of measurement for each page, as well as the content of each line in the PDF file.
Output:
I hope the provided information helps in the debugging of your code. Thank you.
-
Aditya Kommu • 0 Reputation points
2024-04-29T14:39:01.5733333+00:00 Thank you, but this is not extracting in a meaninful manner, I mean
1)tables shodul be mapped correctly to key and values2)any signatures should be given
Any help on those lines?
-
dupammi • 8,540 Reputation points • Microsoft Vendor
2024-04-29T16:09:03.1966667+00:00 Thank you for the update. I understand that your initial query regarding analyzing local files has been resolved, and you are now interested in analyzing documents that contain tables, signatures, and other elements.
To detect the signatures from documents you need to train by using the Custom template model. Please recheck the model you are using Custom template only.
I hope this helps.
Thank you.
-
Aditya Kommu • 0 Reputation points
2024-04-30T00:56:49.1466667+00:00 I am actually looking for table extrcttion, in a proper manner. While extrating cureent page all data is displayed line by line . I need a proper mapping for tables
-
dupammi • 8,540 Reputation points • Microsoft Vendor
2024-04-30T01:18:51.7633333+00:00 Thank you for clarifying your requirements further. It appears there's a shift in focus from simply reading data from local PDFs to more advanced extraction, particularly tables, with proper mapping and handling of signatures.
To achieve table extraction in a structured manner, you'll likely need to utilize a more customized approach. As suggested earlier, leveraging the Custom template model in Document Intelligence Studio could be the way forward for both table extraction and signature handling. By training the model with relevant examples, you can teach it to accurately identify and extract tabular data & handle the signatures.
I hope you understand. Thank you.
-
Campbell, Tehron • 5 Reputation points
2024-06-24T12:56:07.17+00:00 @dupammi Are you able to show this solution for C#? Thank you
-
Carlos Jimenez Uribe • 0 Reputation points
2024-06-25T22:30:19.2433333+00:00 @dupammi thanks for the above explanation and accompanying code. It's all clear, but I was left with one doubt: is there any reason why you did the b64 encoding of the PDF text in line
base64.b64encode(f.read()).decode("utf-8")
? I did the test with a sample PDF of mine, and the results (result.content
) were identical when just processing the PDF without such encoding, like this:from azure.ai.documentintelligence.models import AnalyzeDocumentRequest with open(file_path, "rb") as f: source_bytes = f.read() poller = di_client.begin_analyze_document( model_id="prebuilt-layout", analyze_request=AnalyzeDocumentRequest(bytes_source=source_bytes) )
But, of course, I may be missing something. Thanks in advance.
-
dupammi • 8,540 Reputation points • Microsoft Vendor
2024-06-26T01:59:20.36+00:00 You are correct that the base64 encoding is not strictly necessary for the Document Intelligence API to process the PDF. The approach you mentioned, where the PDF is read directly as bytes and passed to the
begin_analyze_document
method, is indeed valid and might even be simpler and more efficient. Thank you.
Sign in to comment