Problem with custom extraction models and language

Question

Problem with custom extraction models and language

Sky - Einar Brandsson 0

I'm using a custom extraction model in Document Intelligence which finds certain elements in a pdf.

It works in that sense that it is able to find all the elements.

But the text is in a language that is not supported by custom models yet.

Is there a way to tell the document intelligence to use a certain OCR engine that knows my language?
I feel like the content of the fields I'm looking for should not affect how the model finds them.

Setting the locale in AnalyzeDocumentAsync(WaitUntil.Completed, extractFromDocument.Model, documentContent, null, [LOCALE], null); does not work. The special characters are not recognized.

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-09-16T15:39:40+00:00

Hi @Sky - Einar Brandsson,

Thank you for reaching out to Microsoft Q&A forum!

Unfortunately, Azure Document Intelligence does not currently support using a different OCR engine. The OCR engine used by Document Intelligence is optimized for the languages it supports, and it is not possible to switch to a different OCR engine within this service.

To work around this issue, you could preprocess the PDF to extract the text using an OCR engine that supports your language. After extracting the text, you can then use this text as input to your custom model in Document Intelligence. There are various OCR engines available that support a wide range of languages, and you might find one that meets your needs.

Regarding the issue with special characters not being recognized when setting the locale in AnalyzeDocumentAsync, it's possible that the current OCR engine does not support the specific character set used by your language. Using a different OCR tool to preprocess the text might resolve this issue before feeding the data into your custom model.

Please look into: Language support: custom models.

I hope this helps!
Sky - Einar Brandsson 0 Reputation points

2024-09-16T20:58:13.5533333+00:00

@santoshkc Thanks for the reply.
I can get the text from these PDFs before I call my model.
Do you have documentation on how I can do this:
"after extracting the text, you can then use this text as input to your custom model in Document Intelligence"I've only seen methods where I send in a pdf where the Document Intelligence uses OCR to get the text.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-09-17T12:16:13.82+00:00

Hi @Sky - Einar Brandsson,

Thank you for your follow up query.

To use extracted text as input for your custom model in Azure Document Intelligence, start by preprocessing the PDF with an OCR engine that supports your language. After extracting the text, structure it according to the format your custom model expects. While Document Intelligence typically relies on its own OCR, you can manually handle the text extraction beforehand. Once the text is ready, you can input it into your custom extraction mode, ensuring that the model focuses on finding the elements you’ve trained it to recognize.

Please refer to: Build and train a custom extraction model.

I hope you understand. Thank you.
Sky - Einar Brandsson 0 Reputation points

2024-09-17T16:41:50.92+00:00

@santoshkc I'm sorry but I'm unable to understand how I would use the model.
"Once the text is ready, you can input it into your custom extraction mode, ensuring that the model focuses on finding the elements you’ve trained it to recognize."

Could you elaborate in more detail?
Can you provide a code example using C# where the text is the input to the AnalyzeDocumentAsync function and not the PDF file?

1 answer

Your answer

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-09-16T15:39:40+00:00

Hi @Sky - Einar Brandsson,

Thank you for reaching out to Microsoft Q&A forum!

Unfortunately, Azure Document Intelligence does not currently support using a different OCR engine. The OCR engine used by Document Intelligence is optimized for the languages it supports, and it is not possible to switch to a different OCR engine within this service.

To work around this issue, you could preprocess the PDF to extract the text using an OCR engine that supports your language. After extracting the text, you can then use this text as input to your custom model in Document Intelligence. There are various OCR engines available that support a wide range of languages, and you might find one that meets your needs.

Regarding the issue with special characters not being recognized when setting the locale in AnalyzeDocumentAsync, it's possible that the current OCR engine does not support the specific character set used by your language. Using a different OCR tool to preprocess the text might resolve this issue before feeding the data into your custom model.

Please look into: Language support: custom models.

I hope this helps!
Sky - Einar Brandsson 0 Reputation points

2024-09-16T20:58:13.5533333+00:00

@santoshkc Thanks for the reply.
I can get the text from these PDFs before I call my model.
Do you have documentation on how I can do this:
"after extracting the text, you can then use this text as input to your custom model in Document Intelligence"I've only seen methods where I send in a pdf where the Document Intelligence uses OCR to get the text.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-09-17T12:16:13.82+00:00

Hi @Sky - Einar Brandsson,

Thank you for your follow up query.

To use extracted text as input for your custom model in Azure Document Intelligence, start by preprocessing the PDF with an OCR engine that supports your language. After extracting the text, structure it according to the format your custom model expects. While Document Intelligence typically relies on its own OCR, you can manually handle the text extraction beforehand. Once the text is ready, you can input it into your custom extraction mode, ensuring that the model focuses on finding the elements you’ve trained it to recognize.

Please refer to: Build and train a custom extraction model.

I hope you understand. Thank you.
Sky - Einar Brandsson 0 Reputation points

2024-09-17T16:41:50.92+00:00

@santoshkc I'm sorry but I'm unable to understand how I would use the model.
"Once the text is ready, you can input it into your custom extraction mode, ensuring that the model focuses on finding the elements you’ve trained it to recognize."

Could you elaborate in more detail?
Can you provide a code example using C# where the text is the input to the AnalyzeDocumentAsync function and not the PDF file?

Answer 1

Hi @Sky - Einar Brandsson,

Unfortunately, Azure Document Intelligence doesn’t support passing raw text directly as input to the custom extraction model at the moment, it expects a document (e.g., PDF or image) as input and handles the OCR internally. However, you can attempt a workaround by preprocessing the text and injecting the text back into the document in a supported language or with recognized characters before passing it to Document Intelligence.
Here is the sample C# code:

/*
  This code sample shows Custom Model operations with the Azure Form Recognizer client library. 

  To learn more, please visit the documentation - Quickstart: Document Intelligence (formerly Form Recognizer) SDKs
  https://learn.microsoft.com/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api?pivots=programming-language-csharp
*/

using Azure;
using Azure.AI.FormRecognizer.DocumentAnalysis;

/*
  Remember to remove the key from your code when you're done, and never post it publicly. For production, use
  secure methods to store and access your credentials. For more information, see 
  https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
*/

string endpoint = "<endpoint>";
string apiKey = "<apiKey>";
AzureKeyCredential credential = new AzureKeyCredential(apiKey);
DocumentAnalysisClient client = new DocumentAnalysisClient(new Uri(endpoint), credential);

string modelId = "<modelId>";
Uri fileUri = new Uri("<fileUri>");

AnalyzeDocumentOperation operation = await client.AnalyzeDocumentFromUriAsync(WaitUntil.Completed, modelId, fileUri);
AnalyzeResult result = operation.Value;

Console.WriteLine($"Document was analyzed with model with ID: {result.ModelId}");

foreach (AnalyzedDocument document in result.Documents)
{
    Console.WriteLine($"Document of type: {document.DocumentType}");

    foreach (KeyValuePair<string, DocumentField> fieldKvp in document.Fields)
    {
        string fieldName = fieldKvp.Key;
        DocumentField field = fieldKvp.Value;

        Console.WriteLine($"Field '{fieldName}': ");

        Console.WriteLine($"  Content: '{field.Content}'");
        Console.WriteLine($"  Confidence: '{field.Confidence}'");
    }
}

// Iterate over lines and selection marks on each page
foreach (DocumentPage page in result.Pages)
{
    Console.WriteLine($"Lines found on page {page.PageNumber}");
    foreach (var line in page.Lines)
    {
        Console.WriteLine($"  {line.Content}");
    }

    Console.WriteLine($"Selection marks found on page {page.PageNumber}");
    foreach (var selectionMark in page.SelectionMarks)
    {
        Console.WriteLine($"  Selection mark is '{selectionMark.State}' with confidence {selectionMark.Confidence}");
    }
}

// Iterate over the document tables
for (int i = 0; i < result.Tables.Count; i++)
{
    Console.WriteLine($"Table {i + 1}");
    foreach (var cell in result.Tables[i].Cells)
    {
        Console.WriteLine($"  Cell[{cell.RowIndex}][{cell.ColumnIndex}] has content '{cell.Content}' with kind '{cell.Kind}'");
    }
}

I hope you understand! Thank you.

Share via

Problem with custom extraction models and language

1 answer

Your answer