Custom Classification Model Builder (java) unable to access training documents. Is the .ocr.json file required for each file in the container for training?

Michael Wei 0 Reputation points
2025-05-29T13:22:44.44+00:00

I am using the sample buildClassifier.java found on GitHub and have uploaded my training documents under separate folders in a container on Azure. I have managed identity enabled, and granted it storage blob data reader permissions, networking is on public access, correctly generated the SAS token and URL and verified it with the browser test, yet when I run the program, I get "Model Training Failure: TrainingContentMissing: Training data is missing: Could not find any training data at the given path" despite the SAS URL being correct. Is this because the files need to be in the .ocr.json form and not .pdf? If so, how do I do that?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 22,031 Reputation points Volunteer Moderator
    2025-05-29T16:59:08.05+00:00

    Hello Michael Wei,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are attempting to train a custom classification model in Azure Document Intelligence using the Java SDK sample (buildClassifier.java) and you are having error:

    "Model Training Failure: TrainingContentMissing: Could not find any training data at the given path."

    Yes, the error arises because the Java SDK requires .ocr.json files for each document during custom classification training. These files are not auto-generated, you must either:

    • Use Form Recognizer Studio to label and export the project (auto-generates .ocr.json).
    • Or use the prebuilt layout model to analyze your PDFs and save the results manually as .ocr.json files.

    If your .pdf file is named invoice1.pdf, you need a invoice1.ocr.json alongside it in the class-labeled folder. Once you add these, the model should train correctly.

    You can clarify your file settings:

    To train a custom classifier with the Java SDK (like in buildClassifier.java), The folder structure must be:

      container/
      ├── ClassA/
      │   ├── file1.pdf
      │   ├── file1.ocr.json
      ├── ClassB/
      │   ├── file2.pdf
      │   ├── file2.ocr.json
    
    • .ocr.json is required for each file.
    • .labels.json is not required unless doing form labeling.

    Check this document on train a classifier with labeled document for more details.

    Also, aside from your script, double-check Container permissions and URL, then the below checklist might be useful:

    • Folder name must be same as class labels
    • File pairs should have .pdf + .ocr.json
    • File names must match exactly
    • Use latest Java SDK with Document Intelligence
    • Public access or VNet must configure with proper identity
    • SAS must have read permission, and must not expired

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.