Hello Michael Wei,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are attempting to train a custom classification model in Azure Document Intelligence using the Java SDK sample (buildClassifier.java
) and you are having error:
"Model Training Failure: TrainingContentMissing: Could not find any training data at the given path."
Yes, the error arises because the Java SDK requires .ocr.json
files for each document during custom classification training. These files are not auto-generated, you must either:
- Use Form Recognizer Studio to label and export the project (auto-generates
.ocr.json
). - Or use the prebuilt layout model to analyze your PDFs and save the results manually as
.ocr.json
files.
If your .pdf
file is named invoice1.pdf
, you need a invoice1.ocr.json
alongside it in the class-labeled folder. Once you add these, the model should train correctly.
You can clarify your file settings:
To train a custom classifier with the Java SDK (like in buildClassifier.java
), The folder structure must be:
container/
├── ClassA/
│ ├── file1.pdf
│ ├── file1.ocr.json
├── ClassB/
│ ├── file2.pdf
│ ├── file2.ocr.json
-
.ocr.json
is required for each file. -
.labels.json
is not required unless doing form labeling.
Check this document on train a classifier with labeled document for more details.
Also, aside from your script, double-check Container permissions and URL, then the below checklist might be useful:
- Folder name must be same as class labels
- File pairs should have
.pdf
+.ocr.json
- File names must match exactly
- Use latest Java SDK with Document Intelligence
- Public access or VNet must configure with proper identity
- SAS must have read permission, and must not expired
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.