How to create ocr.json files programmatically to train a custom classification model?

Bogdan Pechounov 60 Reputation points
2024-03-05T21:27:29.35+00:00

Similar to this question, I want to add .ocr.json files programmatically to train a custom classifier, but there is no result.GetRawResponse method. How can this be done in python? I didn't find the code in the samples.

import os
from dotenv import load_dotenv

load_dotenv()

endpoint = os.environ["AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"]
key = os.environ["AZURE_DOCUMENT_INTELLIGENCE_KEY"]


def analyse_document(path, model_id="prebuilt-layout"):
    from azure.core.credentials import AzureKeyCredential
    from azure.ai.formrecognizer import DocumentAnalysisClient

    document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )

    # Make sure your document's type is included in the list of document types the custom model can analyze
    with open(path, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            model_id=model_id, document=f
        )
    result = poller.result()
    return result


import json

def save_as_json(result, input_file):
    output_file = f"{input_file}.ocr.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        analyse_result = result.to_dict()
        j = {
            "status": "succeeded",
            "createdDateTime": "2024-02-24T16:53:24Z",
	        "lastUpdatedDateTime": "2024-02-24T16:53:29Z",
            "analyseResult": analyse_result
        } # doesn't work
        json.dump(j, f, ensure_ascii=False, indent=2)


import os

for subdir, dirs, files in os.walk("data"):
    for file in files:
        #print os.path.join(subdir, file)
        filepath = subdir + os.sep + file

        if filepath.endswith(".pdf"):
            input_file = filepath
            print(input_file)
            result = analyse_document(input_file)
            save_as_json(result, input_file)
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,611 questions
{count} votes

Accepted answer
  1. dupammi 8,460 Reputation points Microsoft Vendor
    2024-03-07T13:41:47.5533333+00:00

    Hi @Bogdan Pechounov

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this!

    Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

    To get the raw_response from the REST API (which is different from the SDK), there is a callback method. Otherwise, we can use the requests module.

    For the implementation details, please refer above response from @Bogdan Pechounov

    I hope this helps. Thank you.


    Please do not forget to click Accept Answer and Yes for was this answer helpful, wherever the information provided helps you. This can be beneficial to other community members.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Bogdan Pechounov 60 Reputation points
    2024-03-07T12:45:46.43+00:00

    Thank you for your time.

    To get the raw_response from the REST API (which is different from the SDK), there is a callback method. Otherwise, we can use the requests module.

    def analyse_document(
        endpoint: str,
        key: str,
        file_path: str,
        model_id="prebuilt-layout",
        api_version="2023-07-31"
    ):
        url = f'{endpoint}/formrecognizer/documentModels/{model_id}:analyze?api-version={api_version}'
    
        headers = {
            'Ocp-Apim-Subscription-Key': key,
            'Content-Type': 'application/pdf'
        }
    
        with open(file_path, 'rb') as data:
            x = requests.post(
                url, 
                headers = headers,
                data=data
            )
        return x
    
    
    def get_analyse_result(
        endpoint: str,
        key: str,
        requestId: str,
        model_id="prebuilt-layout",
        api_version="2023-07-31"
    ):
        url = f"{endpoint}/formrecognizer/documentModels/{model_id}/analyzeResults/{requestId}?api-version={api_version}"
    
        headers = {
            'Ocp-Apim-Subscription-Key': key,
        }
        
        x = requests.get(
            url,
            headers=headers
        )
        return x
    
    
    def analyse_document2(
        file_path: str
    ):
        x = analyse_document(
            endpoint=endpoint,
            key=key,
            file_path=file_path
        )
        request_id = x.headers['apim-request-id']
        
        while True:
            analyse_result = get_analyse_result(
                endpoint=endpoint,
                key=key,
                requestId=request_id
            )
            is_running = analyse_result.json()["status"] == "running"
            if not is_running:
                break
    
            time.sleep(1)
        return analyse_result
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.