How to create ocr.json files programmatically to train a custom classification model?

Question

How to create ocr.json files programmatically to train a custom classification model?

Bogdan Pechounov 65

Similar to this question, I want to add .ocr.json files programmatically to train a custom classifier, but there is no result.GetRawResponse method. How can this be done in python? I didn't find the code in the samples.

import os
from dotenv import load_dotenv

load_dotenv()

endpoint = os.environ["AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"]
key = os.environ["AZURE_DOCUMENT_INTELLIGENCE_KEY"]


def analyse_document(path, model_id="prebuilt-layout"):
    from azure.core.credentials import AzureKeyCredential
    from azure.ai.formrecognizer import DocumentAnalysisClient

    document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )

    # Make sure your document's type is included in the list of document types the custom model can analyze
    with open(path, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            model_id=model_id, document=f
        )
    result = poller.result()
    return result


import json

def save_as_json(result, input_file):
    output_file = f"{input_file}.ocr.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        analyse_result = result.to_dict()
        j = {
            "status": "succeeded",
            "createdDateTime": "2024-02-24T16:53:24Z",
	        "lastUpdatedDateTime": "2024-02-24T16:53:29Z",
            "analyseResult": analyse_result
        } # doesn't work
        json.dump(j, f, ensure_ascii=False, indent=2)


import os

for subdir, dirs, files in os.walk("data"):
    for file in files:
        #print os.path.join(subdir, file)
        filepath = subdir + os.sep + file

        if filepath.endswith(".pdf"):
            input_file = filepath
            print(input_file)
            result = analyse_document(input_file)
            save_as_json(result, input_file)

dupammi 8,615 Reputation points Microsoft External Staff

2024-03-06T13:16:45.7133333+00:00

Hi @Bogdan Pechounov

Thank you for using the Microsoft Q&A forum.

Regarding the use of the GetRawResponse method, it is no longer available in the latest version of the Azure Python SDK.

Instead, as shown in your code snippet, you can use the to_dict() method to obtain the OCR JSON response as a dictionary object. You can then use the json.dumps() method to convert the dictionary object to a JSON string.

I hope you understand. Thank you.
romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-03-06T13:38:50.81+00:00

@Bogdan Pechounov I would also like to add that the training of classifier model through SDK or API needs layout model response. As quoted in this documentation.

The layout results should be in the format of the API response when calling layout directly. The SDK object model is different, make sure that the layout results are the API results and not the SDK response.
Bogdan Pechounov 65 Reputation points

2024-03-06T13:51:24.5+00:00

@romungi-MSFT I see, thank you. Is there any sample code using the REST API (without the SDK)?
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-07T10:09:39.3166667+00:00

Hi @Bogdan Pechounov,

Below is the sample code I tried using the REST API of prebuilt layout.

Please use as reference accordingly, as per your other use case(s).

For more details, please refer Document Classifiers - Classify Document.

I hope you understand. Thank you.

Accepted answer

1 additional answer

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2024-03-06T13:16:45.7133333+00:00

Hi @Bogdan Pechounov

Thank you for using the Microsoft Q&A forum.

Regarding the use of the GetRawResponse method, it is no longer available in the latest version of the Azure Python SDK.

Instead, as shown in your code snippet, you can use the to_dict() method to obtain the OCR JSON response as a dictionary object. You can then use the json.dumps() method to convert the dictionary object to a JSON string.

I hope you understand. Thank you.
romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-03-06T13:38:50.81+00:00

@Bogdan Pechounov I would also like to add that the training of classifier model through SDK or API needs layout model response. As quoted in this documentation.

The layout results should be in the format of the API response when calling layout directly. The SDK object model is different, make sure that the layout results are the API results and not the SDK response.
Bogdan Pechounov 65 Reputation points

2024-03-06T13:51:24.5+00:00

@romungi-MSFT I see, thank you. Is there any sample code using the REST API (without the SDK)?
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-07T10:09:39.3166667+00:00

Hi @Bogdan Pechounov,

Below is the sample code I tried using the REST API of prebuilt layout.

Please use as reference accordingly, as per your other use case(s).

For more details, please refer Document Classifiers - Classify Document.

I hope you understand. Thank you.

Answer 1

Hi @Bogdan Pechounov

I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this!

Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

To get the raw_response from the REST API (which is different from the SDK), there is a callback method. Otherwise, we can use the requests module.

For the implementation details, please refer above response from @Bogdan Pechounov

I hope this helps. Thank you.

Please do not forget to click Accept Answer and Yes for was this answer helpful, wherever the information provided helps you. This can be beneficial to other community members.

Answer 2

Thank you for your time.

To get the raw_response from the REST API (which is different from the SDK), there is a callback method. Otherwise, we can use the requests module.

def analyse_document(
    endpoint: str,
    key: str,
    file_path: str,
    model_id="prebuilt-layout",
    api_version="2023-07-31"
):
    url = f'{endpoint}/formrecognizer/documentModels/{model_id}:analyze?api-version={api_version}'

    headers = {
        'Ocp-Apim-Subscription-Key': key,
        'Content-Type': 'application/pdf'
    }

    with open(file_path, 'rb') as data:
        x = requests.post(
            url, 
            headers = headers,
            data=data
        )
    return x


def get_analyse_result(
    endpoint: str,
    key: str,
    requestId: str,
    model_id="prebuilt-layout",
    api_version="2023-07-31"
):
    url = f"{endpoint}/formrecognizer/documentModels/{model_id}/analyzeResults/{requestId}?api-version={api_version}"

    headers = {
        'Ocp-Apim-Subscription-Key': key,
    }
    
    x = requests.get(
        url,
        headers=headers
    )
    return x


def analyse_document2(
    file_path: str
):
    x = analyse_document(
        endpoint=endpoint,
        key=key,
        file_path=file_path
    )
    request_id = x.headers['apim-request-id']
    
    while True:
        analyse_result = get_analyse_result(
            endpoint=endpoint,
            key=key,
            requestId=request_id
        )
        is_running = analyse_result.json()["status"] == "running"
        if not is_running:
            break

        time.sleep(1)
    return analyse_result

Share via

How to create ocr.json files programmatically to train a custom classification model?

1 additional answer

Your answer