BuildDocumentClassifierRequest from python SDK resulting in TrainingContentMissing: Training data is missing: Could not find any training data at the given path.

Question

BuildDocumentClassifierRequest from python SDK resulting in TrainingContentMissing: Training data is missing: Could not find any training data at the given path.

Rony Tayoun 20

Trying to train a new classifier from python SDK, my doc_types are as follows:
{'0201': {'azureBlobSource': {'containerUrl': '...', 'prefix': "examples-de-chaque-document/examples-class1/"}}, ....

I have checked the containerUrl works (sp and sr included)

I get an output:
Model training failure

TrainingContentMissing: Training data is missing: Could not find any training data at the given path.

Help !

from azure.ai.documentintelligence.models import BuildDocumentClassifierRequest
import uuid

# Generate a unique classifier ID
classifier_id = f"top-level-classifier-{uuid.uuid4()}"

build_request = BuildDocumentClassifierRequest(
    classifier_id=classifier_id,  # mandatory
    description="Top-level classifier for Excel codes",
    doc_types=doc_types,
    allow_overwrite=True
)

# Start training
poller = admin_client.begin_build_classifier(build_request)
print(f"Training started asynchronously! Classifier ID: {classifier_id}")

Jerald Felix 9,835 Reputation points

2025-10-27T02:04:42.03+00:00
Hello Rony Tayoun,

Thanks for sharing your code and the error details—this is a frequent hiccup when training classifiers with Azure AI Document Intelligence (formerly Form Recognizer), especially using the Python SDK with Azure Blob sources. The "TrainingContentMissing" error means the service can't locate or access the documents at the specified paths in your blob container, even if the container URL (with SAS token) validates fine. As a Microsoft Certified Trainer and Azure AI specialist, I've troubleshot this in several training sessions, and it's often tied to path resolution, file formats, or access scopes rather than the SDK syntax itself. I'll break down the causes and fixes based on the latest SDK v1.1.0+ and service behaviors as of October 2025.

Understanding the Error

The BuildDocumentClassifierRequest expects doc_types to map each class (e.g., '0201') to an AzureBlobSource with a valid containerUrl (including SAS for read access) and a prefix that points to a subfolder containing at least 5 sample documents per class for training. The service scans the prefix recursively but fails if no eligible files are found—your setup with "examples-de-chaque-document/examples-class1/" looks close, but paths are case-sensitive, and unsupported formats (e.g., encrypted PDFs) or empty folders trigger this exact message. The poller starting asynchronously is good, but the underlying validation during begin_build_classifier checks accessibility immediately. From recent Microsoft Q&A and docs, this error spiked after the June 2025 SDK update, which tightened blob path parsing for security.

Key Requirements for Training Data

Before diving into code fixes, verify your blob setup:

File Count and Variety: Each doc_type needs 5+ documents (e.g., PDFs, images, or Word files; max 50MB each). They should represent real variations of the class (e.g., different layouts for "class1").

Supported Formats: Stick to PDF, PNG, JPEG, BMP, TIFF, or HEIC. The service ignores non-extractable files.

Blob Structure: Files must be directly under or nested in the prefix path. No need for a manifest file— the service auto-detects.

SAS Permissions: Your containerUrl should have sr=c (container scope), sp=rl (read+list), and no expiry issues. Test access via Azure Storage Explorer.

Step-by-Step Fixes for Your Code

Your BuildDocumentClassifierRequest is mostly correct, but let's refine it and add diagnostics. Update your doc_types definition like this (assuming multiple classes):

from azure.ai.documentintelligence import DocumentIntelligenceAdministrationClient from azure.ai.documentintelligence.models import ( AzureBlobContentSource, BuildDocumentClassifierRequest, DocTypeInfo ) from azure.core.credentials import AzureKeyCredential import uuid # Your endpoint and key (from Azure portal) endpoint = "https://your-region.api.cognitive.microsoft.com/" key = "your-api-key" admin_client = DocumentIntelligenceAdministrationClient( endpoint=endpoint, credential=AzureKeyCredential(key) ) # Define doc_types properly - ensure each is a dict with AzureBlobContentSource doc_types = { '0201': { 'azureBlobSource': AzureBlobContentSource( container_url="https://yourstorageaccount.blob.core.windows.net/yourcontainer?sas_token_here", prefix="examples-de-chaque-document/examples-class1/" # Ensure this path exists and has files ) }, # Add other classes similarly, e.g., '0202': { 'azureBlobSource': AzureBlobContentSource( container_url="https://yourstorageaccount.blob.core.windows.net/yourcontainer?sas_token_here", prefix="examples-de-chaque-document/examples-class2/" ) } # At least two doc_types required for classification } # Optional: Add model_id if building on a prebuilt classifier classifier_id = f"top-level-classifier-{uuid.uuid4()}" build_request = BuildDocumentClassifierRequest( classifier_id=classifier_id, description="Top-level classifier for Excel codes", doc_types=doc_types, # Now using AzureBlobContentSource objects build_mode="template", # Or "neural" for advanced; default is fine allow_overwrite=True ) try: poller = admin_client.begin_build_classifier( document_model_build_request=build_request, content_type="application/json" ) print(f"Training started! Classifier ID: {classifier_id}") # Poll for result with details result = poller.result() if result.status == "failed": print(f"Error details: {result.errors}") else: print(f"Classifier built successfully: {result.model_id}") except Exception as e: print(f"Build failed: {e}")

Changes Explained:

Use AzureBlobContentSource explicitly instead of a plain dict— this ensures proper serialization in the SDK.

Add build_mode for clarity (neural is better for varied docs).

Wrap in try-except to capture full errors, including path-specific issues from result.errors.

Ensure at least two doc_types for a classifier (one won't train).

Troubleshooting Steps

Validate Blob Paths Manually:

In Azure portal, go to Storage accounts > Your container > Browse to the prefix (e.g., examples-de-chaque-document/examples-class1/). Confirm 5+ files are listed and publicly inaccessible without SAS.

Test SAS: Paste the full containerUrl into a browser or Azure Storage Explorer—if it lists files, it's good; if "Resource not found," regenerate the SAS with broader permissions.

Check Prefix Accuracy:

Blobs are case-sensitive: If your folder is "Examples-Class1", update the prefix accordingly.

Avoid trailing slashes if not needed; try without: prefix="examples-de-chaque-document/examples-class1".

If files are nested deeper, include the full path (e.g., "root/folder/subfolder/").

Run a Pre-Scan:

Use the Storage SDK to list blobs:
from azure.storage.blob import BlobServiceClient blob_service_client = BlobServiceClient.from_connection_string("your_connection_string") container_client = blob_service_client.get_container_client("yourcontainer") blobs = container_client.list_blobs(name_starts_with="examples-de-chaque-document/examples-class1/") print(f"Found {len(list(blobs))} files")
If zero, that's your issue—upload/fix paths.

SDK and Service Version:

Update SDK: pip install --upgrade azure-ai-documentintelligence.

Ensure your Document Intelligence resource is in a supported region (e.g., East US) and not in free tier (limited training quota).

Poll longer: Training can take 10-30 mins; use poller.wait(timeout=3600) for updates.

If Still Failing:

Check quotas in Azure portal under Document Intelligence > Usages and quotas—increase if at limit.

Try local files first: Use begin_build_document_classifier with a local folder path instead of blob to isolate if it's a storage issue.

Support: Raise a ticket via Azure portal > Help + support > Technical > AI + Machine Learning > Document Intelligence, including your classifier_id, error, and a sample SAS URL (redacted).

Best Practices for Classifier Training

Start small: Train with 5-10 docs per class, then iterate.

Use labeled data: For better accuracy, pair with the labeling tool in the portal after initial build.

Monitor via portal: Track progress in Document Intelligence > Custom models > Classifiers.

Scale with neural mode for complex docs like invoices or forms.

This should resolve the missing training data error and get your classifier training. Kinly approves the answer.

Best Regards,

Jerald Felix

Rony Tayoun 20

Hi Jerald,

Thank you for your swift response,
I restarted from scratch and this time, I am trying an example I have in my studio.
I have 2 classes '2058-a' and '2058-b'
I have debugged as you mentioned and I am currently here:

I made sure that the containerURL works by using it to print the existing files.



container_client = ContainerClient.from_container_url(container_sas_url)
blobs = container_client.list_blobs(name_starts_with="examples-de-chaque-document/2058-a/")
print(f"Found {len(list(blobs))} files")

container_client = ContainerClient.from_container_url(container_sas_url)
blobs = container_client.list_blobs(name_starts_with="examples-de-chaque-document/2058-b/")
print(f"Found {len(list(blobs))} files")

The files I am using hera are the same files I use in the document intelligence studio (6 pdfs for each class)

doc_types = {
    '2058a': {
        'azureBlobSource': AzureBlobContentSource(
            container_url=container_sas_url,
            prefix="examples-de-chaque-document/2058-a/"  # Ensure this path exists and has files
        )
    },

    # Add other classes similarly, e.g.,
    '2058b': {
        'azureBlobSource': AzureBlobContentSource(
            container_url=container_sas_url,
            prefix="examples-de-chaque-document/2058-b/"
        )
    }
    # At least two doc_types required for classification

}

In the next section, I try to call the begin_build_classifier. Note that in the latest SDK ,
BuildDocumentClassifierRequest does not accept build_mode as parameter as you suggested and also begin_build_classifier accepts one parameter body. So the code is as follows :



# Optional: Add model_id if building on a prebuilt classifier

classifier_id = f"new-test-{uuid.uuid4()}"

build_request = BuildDocumentClassifierRequest(

    classifier_id=classifier_id,

    description="Example classifier",

    doc_types=doc_types,  # Now using AzureBlobContentSource objects

    allow_overwrite=True,
)

try:

    poller = admin_client.begin_build_classifier(
        build_request
    )

    print(f"Training started! Classifier ID: {classifier_id}")

    

    # Poll for result with details

    result = poller.result()

    if result.status == "failed":

        print(f"Error details: {result.errors}")

    else:

        print(f"Classifier built successfully: {result.model_id}")

        

except Exception as e:

    print(f"Build failed: {e}")

Training started! Classifier ID: new-test-62c4dfb9-2765-46f2-86fc-609cc8603672 Build failed: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Exception Details: (TrainingContentMissing) Training data is missing: Could not find any training data at the given path. Code: TrainingContentMissing Message: Training data is missing: Could not find any training data at the given path.

It is still failing.

Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-10-27T10:26:19.3766667+00:00

Hello Rony Tayoun,

Good Day.

I understand you're encountering a TrainingContentMissing error when trying to build a document classifier using the v4.0 Python SDK, even though your SAS token for the containerUrl is valid.

All me some time to look into the issue and get back to you.
Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-10-30T07:21:18.3266667+00:00

Hello Rony Tayoun,
Following up to check if you had a chance to see my response.
Please let me know if you are still facing any issue.
Tayoun, Rony (Oerlikon) 0 Reputation points

2025-10-30T09:40:31.95+00:00
Hi Nikhil,

You gave me contradictory information in your answers.

First answer you mentioned:
The root cause of the TrainingContentMissing error is that the v4.0 AzureBlobContentSource model does not have a prefix parameter. I see you are still passing a prefix parameter, just like in your original dictionary.

Second answer you mentioned:
The AzureBlobContentSource model in the azure-ai-documentintelligence (v4.0) SDK requires two separate arguments:

container_url: The SAS URL for the container root only.

prefix: A string for the "folder" path within that container.

In any case, I tried both approached and still does not work. My workaround is to upload all files in one folder, create dynamically the jsonl file containing the filenames for each class and retraining manually using portal.

so retraining classifier from SDK still does not work. I would appreciate if someone opens a support ticket (I cannot anymore with my developer support plan).

Nikhil Jha (Accenture International Limited) 4,150 Microsoft External Staff Moderator

Hello Rony Tayoun,

You are absolutely right, and I sincerely apologize for the confusion my information had caused.

You've confirmed that even with the correct syntax from my second answer, the azureBlobSource (folder) method still fails. Let's try another workaround—using a manually created JSONL file in the portal. the issue suggests it's a permissions or service-side problem with the azureBlobSource method. This method requires the service to use List permissions on your SAS token to scan the folder and find the files. This step can be sensitive and is clearly failing, even if your SAS token looks correct.

Let's Automate Your Workaround in the SDK:

Step 1: Create a JSONL File, the paths inside it should be relative to the root of your container.
for example, classifier_file_list.jsonl:

{"file": "examples-de-chaque-document/2058-a/doc5.pdf", "docType": "2058a"}
{"file": "examples-de-chaque-document/2058-a/doc6.pdf", "docType": "2058a"}
{"file": "examples-de-chaque-document/2058-b/doc7.pdf", "docType": "2058b"}
{"file": "examples-de-chaque-document/2058-b/doc8.pdf", "docType": "2058b"}

Step 2:
Upload all your PDF documents AND the classifier_file_list.jsonl file to your Azure Storage container.

Step 3:
Now, your SDK code becomes much simpler.
You only need one BuildDocumentClassifierRequest that points to that single JSONL file.

from azure.ai.documentintelligence.models import (
    BuildDocumentClassifierRequest,
    AzureBlobFileListContentSource
)
import uuid

# SAS URL for the container root
# This token only needs Read and List permissions
container_sas_url = f"https{account_name}.blob.core.windows.net/{container_name}?{sas_token_only}"  

file_list_name = "classifier_file_list.jsonl" 

# Generate a unique classifier ID
classifier_id = f"jsonl-classifier-{uuid.uuid4()}"

# Build the request using AzureBlobFileListContentSource
build_request = BuildDocumentClassifierRequest(
    classifier_id=classifier_id,
    description="Classifier built from JSONL file list",
    
    # This is the new part:
    # Use azure_blob_file_list_source instead of doc_types
    azure_blob_file_list_source=AzureBlobFileListContentSource(
        container_url=container_sas_url,
        file_list=file_list_name
    ),
    
    allow_overwrite=True
)

try:
    print(f"Training started! Classifier ID: {classifier_id}")
    poller = admin_client.begin_build_classifier(build_request)
    result = poller.result()
    print(f"Classifier built successfully: {result.model_id}")

except Exception as e:
    print(f"Build failed: {e}")

It provides the service with an explicit file list, removing all ambiguity from folder paths and List permissions.

Apologies again for the difficult troubleshooting path. I hope this provides a stable, automatable solution for you.

Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-11-06T04:39:54.4733333+00:00

Hello Rony Tayoun,
Following up to check. If you had a chance to review my response.

1 answer

Your answer

Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-10-27T10:26:19.3766667+00:00

Hello Rony Tayoun,

Good Day.

I understand you're encountering a TrainingContentMissing error when trying to build a document classifier using the v4.0 Python SDK, even though your SAS token for the containerUrl is valid.

All me some time to look into the issue and get back to you.
Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-10-30T07:21:18.3266667+00:00

Hello Rony Tayoun,
Following up to check if you had a chance to see my response.
Please let me know if you are still facing any issue.
Tayoun, Rony (Oerlikon) 0 Reputation points

2025-10-30T09:40:31.95+00:00

Hi Nikhil,

You gave me contradictory information in your answers.

First answer you mentioned:
The root cause of the TrainingContentMissing error is that the v4.0 AzureBlobContentSource model does not have a prefix parameter. I see you are still passing a prefix parameter, just like in your original dictionary.

Second answer you mentioned:
The AzureBlobContentSource model in the azure-ai-documentintelligence (v4.0) SDK requires two separate arguments:

container_url: The SAS URL for the container root only.

prefix: A string for the "folder" path within that container.

In any case, I tried both approached and still does not work. My workaround is to upload all files in one folder, create dynamically the jsonl file containing the filenames for each class and retraining manually using portal.

so retraining classifier from SDK still does not work. I would appreciate if someone opens a support ticket (I cannot anymore with my developer support plan).
Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-11-06T04:39:54.4733333+00:00

Hello Rony Tayoun,
Following up to check. If you had a chance to review my response.

Answer 1

Hello Rony Tayoun,

Thank you for providing such a detailed follow-up, including your new code and the persistent error message. Your thorough testing helps us pinpoint the exact problem, based on the code you've shared, your issue is a very specific and critical breaking change in the new v4.0 SDK (azure-ai-documentintelligence) that you are using.

The root cause of the TrainingContentMissing error is that the v4.0 AzureBlobContentSource model does not have a prefix parameter. I see you are still passing a prefix parameter, just like in your original dictionary.

The Python SDK is simply ignoring this unknown prefix argument. As a result, it is using your container_sas_url (which points to the root of your container) and looking for your training files there. Since your files are not at the root—they are in the "folder" specified by your prefix—the service correctly reports that it cannot find any training data at the given path.

The solution is to remove the prefix parameter and instead append the folder path directly to the container_url string before the SAS token.

Recommended Steps:

1.Your SAS URL must point directly to the specific "folder" containing the files for that class.


container_sas_url = "[YOUR_BASE_SAS_URL_WITH_TOKEN]" 
# e.g., "https://myaccount.blob.core.windows.net/mycontainer?sv=..."
# Append the prefix (folder path) to the container name


url_2058a = "https://[STORAGE_NAME].blob.core.windows.net/[CONTAINER_NAME]/examples-de-chaque-document/2058-a/?[SAS_TOKEN]"

url_2058b = "https://[STORAGE_NAME].blob.core.windows.net/[CONTAINER_NAME]/examples-de-chaque-document/2058-b/?[SAS_TOKEN]"

Note: You must generate a SAS token at the container level, not the blob level, for this to work. Also, check the trailing slash '/' after the folder name, before the '?'

Now, build your doc_types object using these new, complete URLs and no prefix parameter.

from azure.ai.documentintelligence.models import AzureBlobContentSource

doc_types = {
    '2058a': {
        'azureBlobSource': AzureBlobContentSource(
            container_url=url_2058a # Use the full path with the folder
            # NO 'prefix' parameter here
        )
    },
    '2058b': {
        'azureBlobSource': AzureBlobContentSource(
            container_url=url_2058b
        )
    }
}

Run Your Existing Training Code

Your BuildDocumentClassifierRequest and begin_build_classifier code is already correct. You do not need to change it. Simply run it again using the corrected doc_types object from Step 2, and the service will now find your files.

For more information, please refer to the official Microsoft documentation:

AzureBlobContentSource (v4.0 SDK): Note this model only has container_url.

ClassifierDocumentTypeDetails (v3 SDK - for comparison): This is the old model that used container_url and prefix separately. This shows the change.

Please let us know if this helps. If yes, kindly "Accept the answer" and/or upvote, so it will be beneficial to others in the community as well.

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Nikhil Jha (Accenture International Limited) 4,150 Reputation points Microsoft External Staff Moderator

2025-10-29T05:09:46.8133333+00:00

Hello Rony Tayoun,

The AzureBlobContentSource model in the azure-ai-documentintelligence (v4.0) SDK requires two separate arguments:

container_url: The SAS URL for the container root only.

prefix: A string for the "folder" path within that container.

your new error is because the service is receiving a container_url that has the folder path (.../2058-a/) already in it, which it rejects.

Let's correct this by splitting the URL and the prefix, as the SDK model intends.

1.Your sas_token_only generation is correct. The container_sas_url should only point to the container.

sas_token_only = generate_container_sas(...) # This URL should point ONLY to the container, not the subfolders container_sas_url = f"https://{account_name}.blob.core.windows.net/{container_name}?{sas_token_only}"

2.Now, pass the container_sas_url and the prefix string as separate, named arguments to the AzureBlobContentSource model.

from azure.ai.documentintelligence.models import AzureBlobContentSource prefix_2058a = "examples-de-chaque-document/2058-a/" prefix_2058b = "examples-de-chaque-document/2058-b/" doc_types = { '2058a': { 'azureBlobSource': AzureBlobContentSource( container_url=container_sas_url, # The SAS URL for the container root prefix=prefix_2058a # The specific folder path ) }, '2058b': { 'azureBlobSource': AzureBlobContentSource( container_url=container_sas_url, # The SAME SAS URL for the container prefix=prefix_2058b # The specific folder path ) } } print(f"Correctly formatted doc_types: {doc_types}")

3.Your existing training code is perfect. When you run it now with this corrected doc_types object, the service will receive the container URL and the prefix in the separate fields it expects, and it should find your training files.

Share via

BuildDocumentClassifierRequest from python SDK resulting in TrainingContentMissing: Training data is missing: Could not find any training data at the given path.

Understanding the Error

Key Requirements for Training Data

Step-by-Step Fixes for Your Code

Troubleshooting Steps

Best Practices for Classifier Training

1 answer

Your answer