Custom Classification Model Builder (java) unable to access training documents. Is the .ocr.json file required for each file in the container for training?

Question

Custom Classification Model Builder (java) unable to access training documents. Is the .ocr.json file required for each file in the container for training?

Michael Wei 0

I am using the sample buildClassifier.java found on GitHub and have uploaded my training documents under separate folders in a container on Azure. I have managed identity enabled, and granted it storage blob data reader permissions, networking is on public access, correctly generated the SAS token and URL and verified it with the browser test, yet when I run the program, I get "Model Training Failure: TrainingContentMissing: Training data is missing: Could not find any training data at the given path" despite the SAS URL being correct. Is this because the files need to be in the .ocr.json form and not .pdf? If so, how do I do that?

1 answer

Your answer

Answer 1

Sina Salam 22,031 Volunteer Moderator

Hello Michael Wei,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are attempting to train a custom classification model in Azure Document Intelligence using the Java SDK sample (buildClassifier.java) and you are having error:

"Model Training Failure: TrainingContentMissing: Could not find any training data at the given path."

Yes, the error arises because the Java SDK requires .ocr.json files for each document during custom classification training. These files are not auto-generated, you must either:

Use Form Recognizer Studio to label and export the project (auto-generates .ocr.json).
Or use the prebuilt layout model to analyze your PDFs and save the results manually as .ocr.json files.

If your .pdf file is named invoice1.pdf, you need a invoice1.ocr.json alongside it in the class-labeled folder. Once you add these, the model should train correctly.

You can clarify your file settings:

To train a custom classifier with the Java SDK (like in buildClassifier.java), The folder structure must be:

  container/
  ├── ClassA/
  │   ├── file1.pdf
  │   ├── file1.ocr.json
  ├── ClassB/
  │   ├── file2.pdf
  │   ├── file2.ocr.json

.ocr.json is required for each file.
.labels.json is not required unless doing form labeling.

Check this document on train a classifier with labeled document for more details.

Also, aside from your script, double-check Container permissions and URL, then the below checklist might be useful:

Folder name must be same as class labels
File pairs should have .pdf + .ocr.json
File names must match exactly
Use latest Java SDK with Document Intelligence
Public access or VNet must configure with proper identity
SAS must have read permission, and must not expired

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Michael Wei 0 Reputation points

2025-05-29T17:23:35.1966667+00:00

Hi Sina, thank you for the prompt response and for the help. The overall goal of this project is to automate the process of building, training and testing a custom classification model in Java. So, regarding the ocr.json format, is there a way to programically generate these files and then upload them to the Azure blob storage?
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-05-29T17:43:25.0866667+00:00

Hi Michael Wei,

Thank for your feedback.

You can programmatically generate .ocr.json files for Azure Document Intelligence's custom classifier training using Java, the steps are:

Step 1: Generate OCR Results via Document Intelligence SDK

Step 2: Process Training Files & Upload to Azure Storage

Then, follow all requirements as listed in the answer.

If you need more details on structured approach and code sample from AI, I can share some with you here.
Michael Wei 0 Reputation points

2025-05-29T17:46:49.36+00:00

Yes, more details and code samples would be much appreciated. Thank you Sina

Sina Salam 22,031 Volunteer Moderator

To programmatically generate .ocr.json files for Azure Document Intelligence's custom classifier training using Java, follow this structured approach:

Important Considerations is as listed in the main answer.

Step 1: Generate OCR Results via Document Intelligence SDK

Use the prebuilt-layout model to analyze PDFs and save the OCR results in the required .ocr.json format.

import com.azure.ai.documentintelligence.DocumentIntelligenceClient;
import com.azure.ai.documentintelligence.DocumentIntelligenceClientBuilder;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.azure.ai.documentintelligence.models.AnalyzeDocumentRequest;
import com.azure.core.util.BinaryData;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.nio.file.Files;
import java.nio.file.Path;
public class OcrGenerator {
    
    public static void generateOcrForPdf(
        String endpoint, 
        String apiKey, 
        Path pdfPath, 
        Path outputJsonPath
    ) throws Exception {
        
        // Initialize client
        DocumentIntelligenceClient client = new DocumentIntelligenceClientBuilder()
            .credential(new AzureKeyCredential(apiKey))
            .endpoint(endpoint)
            .buildClient();
        
        // Read PDF bytes
        byte[] pdfBytes = Files.readAllBytes(pdfPath);
        
        // Analyze with layout model
        AnalyzeResult ocrResult = client.analyzeDocument(
            "prebuilt-layout",
            new AnalyzeDocumentRequest()
                .setBase64Source(BinaryData.fromBytes(pdfBytes))
        ).getAnalyzeResult();
        
        // Serialize to JSON
        ObjectMapper mapper = new ObjectMapper();
        mapper.writeValue(outputJsonPath.toFile(), ocrResult);
    }
}

Step 2: Process Training Files & Upload to Azure Storage

Automate the workflow:

Iterate through your local training directory
Generate missing .ocr.json files
Upload pairs to Azure Blob Storage

import com.azure.storage.blob.BlobClient;
import com.azure.storage.blob.BlobContainerClient;
import com.azure.storage.blob.BlobServiceClient;
import com.azure.storage.blob.BlobServiceClientBuilder;
import java.nio.file.Path;
import java.nio.file.Paths;
public class TrainingDataProcessor {
    
    public static void processAndUpload(
        String diEndpoint,
        String diKey,
        String storageConnStr,
        String containerName,
        Path localRootPath
    ) throws Exception {
        // Initialize clients
        BlobContainerClient blobContainer = new BlobServiceClientBuilder()
            .connectionString(storageConnStr)
            .buildClient()
            .getBlobContainerClient(containerName);
        
        // Iterate class folders (e.g., "ClassA", "ClassB")
        Files.list(localRootPath).forEach(classDir -> {
            if (Files.isDirectory(classDir)) {
                String className = classDir.getFileName().toString();
                
                // Process each PDF
                try {
                    Files.list(classDir)
                        .filter(p -> p.toString().endsWith(".pdf"))
                        .forEach(pdfPath -> {
                            try {
                                Path jsonPath = Paths.get(pdfPath.toString().replace(".pdf", ".ocr.json"));
                                
                                // Generate OCR if missing
                                if (!Files.exists(jsonPath)) {
                                    OcrGenerator.generateOcrForPdf(
                                        diEndpoint, diKey, pdfPath, jsonPath
                                    );
                                }
                                
                                // Upload to blob storage
                                uploadToBlob(blobContainer, className, pdfPath);
                                uploadToBlob(blobContainer, className, jsonPath);
                                
                            } catch (Exception e) {
                                e.printStackTrace();
                            }
                        });
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });
    }
    
    private static void uploadToBlob(
        BlobContainerClient container, 
        String className, 
        Path filePath
    ) {
        String blobName = className + "/" + filePath.getFileName().toString();
        BlobClient blobClient = container.getBlobClient(blobName);
        blobClient.uploadFromFile(filePath.toString(), true); // Overwrite if exists
    }
}

Key Requirements & Notes

File Pair Structure:


   /training_data

     ├── ClassA

     │   ├── doc1.pdf

     │   ├── doc1.ocr.json  # Auto-generated

     ├── ClassB

     │   ├── doc2.pdf

     │   └── doc2.ocr.json

Permissions:
- Azure Blob Storage: Contributor or Storage Blob Data Contributor role
- Document Intelligence: DI API access key
Dependencies (Maven):

   <dependency>
     <groupId>com.azure</groupId>
     <artifactId>azure-ai-documentintelligence</artifactId>
     <version>1.0.0</version> <!-- Check latest -->
   </dependency>
   <dependency>
     <groupId>com.azure</groupId>
     <artifactId>azure-storage-blob</artifactId>
     <version>12.25.0</version> <!-- Check latest -->
   </dependency>
   <dependency>
     <groupId>com.fasterxml.jackson.core</groupId>
     <artifactId>jackson-databind</artifactId>
     <version>2.15.0</version>
   </dependency>

Cost Optimization:
- Cache generated .ocr.json files locally to avoid reprocessing
- Use Azure's free tier (5,000 pages/month) for initial testing

Execution Flow

public static void main(String[] args) throws Exception {
    TrainingDataProcessor.processAndUpload(
        "https://<your-di-resource>.cognitiveservices.azure.com/",
        "di-api-key-here",
        "DefaultEndpointsProtocol=...", // Azure storage conn string
        "training-container",
        Paths.get("local_data/")  // Local training data root
    );
    
    // Now run your buildClassifier.java
    BuildClassifier.main(new String[]{
        "<storage-container-url-with-sas-token>"
    });
}

NOTE: The code sample above is generated by DeepSeek and edited by Copilot. Minor error might occur.

By automating OCR generation and uploads, you enable end-to-end pipeline execution without manual Studio intervention. This approach maintains compatibility with Azure's classifier training requirements while fitting into Java-based automation workflows.

Michael Wei 0 Reputation points

2025-05-29T20:34:15.1266667+00:00

Hi Sina, thank you for your prompt response. The issue with serializing the result of running the prebuilt analyze layout model is that the generated ocr.json file is actually different from the file that Azure expects for its training, and as such causes an error. I also verified this on the portal as the generated (serialized) ocr.json file took up a different amount of space as the ocr.json file that was automatically generated by Azure when I ran the custom classification model on it in the portal. I was looking online and saw on this forum: https://learn.microsoft.com/en-us/answers/questions/1379003/formsrecognizer-how-to-create-ocr-json-files-progr that someone else had the same issue as I do now and was able to resolve it. However, their method seems to be no longer functional as the getRawResponse() method they used to get a BinaryData object no longer exists, and it doesn't seem like .getFinalResult returns the raw JSON file anymore. Do you know if there is some workaround for this?
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-05-29T23:15:35.2433333+00:00

Hi Michael Wei,

You're absolutely right—this is a known issue when trying to programmatically generate ocr.json files for Azure Form Recognizer's custom classification model training. The JSON structure returned by the SDK (e.g., from the begin_analyze_document method using the prebuilt layout model) differs from the structure expected by the training pipeline in Form Recognizer Studio.

If you really keen to automate this, could you share a sample of the AnalyzeResult JSON you’re currently getting and, if possible, a sample of the expected ocr.json format from the Studio? That way, I can tailor the transformation logic precisely.

I found alternative answer that similar to your question here: https://learn.microsoft.com/en-us/answers/questions/1379003/formsrecognizer-how-to-create-ocr-json-files-progr
santoshkc 15,245 Reputation points Microsoft External Staff Moderator

2025-05-30T11:48:18.19+00:00

Hi @Sina Salam,

Just checking, was the above information helpful in clarifying the issue with generating ocr.json files for custom classification? And, if you have any further query do let us know.
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-05-30T12:05:46.1766667+00:00
Hello all,

I provided a generally correct and structured approach for generating OCR data via the SDK and uploading it with examples.

To @Michael Wei check my last comment up and you can:

Use the REST API to retrieve and store the raw OCR result, which aligns with the structure needed for training. Java SDK currently lacks this functionality directly, so REST is the safest automation route.

@santoshkc you should have directed the follow-up to Michael Wei. Success.
Michael Wei 0 Reputation points

2025-05-30T13:21:16.3533333+00:00

Hello Sina and @santoshkc, thank you for your quick responses. Like I said in my last message, the alternative answer similar to my question found on the forum that both you and I have linked, is no longer a valid solution as the GetRawResponse() method seems to no longer exist, and was only available in older versions of Azure Document Intelligence, and that was to my knowledge the only way to mimic the automatically generated ocr files on the portal. As such, I was wondering what the new method of getting the raw response was. Could you provide some more details about the REST API and how that would allow me to get the raw ocr json file, particularly info from Azure documentation rather than AI as most AI services seem to be providing outdated solutions? Thank you for your patience
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-05-30T16:53:49.3233333+00:00
Hello Michael Wei,

Thank you always for your feedback.

Since the Java SDK cannot provide the expected .ocr.json structure, you must use the REST API to get the raw AnalyzeResult in a training-compatible structure.

My recommendations for you are to:

Switch to REST API: Don’t use the Java SDK’s AnalyzeResult for .ocr.json training files. It won’t work unless transformed.

Store the exact raw response from the polling operation in a .ocr.json file without any modifications.

Automate using HTTP libraries (e.g., Apache HttpClient, OkHttp) in Java.
Michael Wei 0 Reputation points

2025-05-30T19:54:26.72+00:00

Hello Sina. Thank you for your suggestion, using the REST API allows me to extract the raw response from the operation, and I am now able to programmatically automate the entire process of building a custom classifier. Thank you very much for your patience and help.

Share via

Custom Classification Model Builder (java) unable to access training documents. Is the .ocr.json file required for each file in the container for training?

1 answer

Your answer