Training custom classifier using the REST API gives error: InvalidContentSourceFormat

Tommy Holm Jakobsen 0 Reputation points
2024-05-17T06:54:34.0633333+00:00

I'm trying to train a customer classifier using the REST API, but I'm getting the follow error, and I can't figure out where my fault is. I've tripel-checked configurations and tried various differences...

InvalidContentSourceFormat
Invalid content source: Could not read build content.

Here's the full response:

$ GET https://westeurope.api.cognitive.microsoft.com/documentintelligence/operations/31450441458_cfd60439-37fd-4653-95f3-254fb4355723?api-version=2024-02-29-preview

{
  "operationId": "31450441458_cfd60439-37fd-4653-95f3-254fb4355723",
  "kind": "documentClassifierBuild",
  "status": "failed",
  "createdDateTime": "2024-05-17T06:15:41Z",
  "lastUpdatedDateTime": "2024-05-17T06:15:42Z",
  "resourceLocation": "https://westeurope.api.cognitive.microsoft.com/documentintelligence/documentClassifiers/test-01?api-version=2024-02-29-preview",
  "percentCompleted": 100,
  "error": {
    "code": "InvalidArgument",
    "message": "Invalid argument.",
    "details": [
      {
        "code": "InvalidContentSourceFormat",
        "message": "Invalid content source: Could not read build content."
      }
    ]
  },
  "apiVersion": "2024-02-29-preview"
}

The following request was used to start training a new model:

$ POST https://westeurope.api.cognitive.microsoft.com/documentintelligence/documentClassifiers:build?api-version=2024-02-29-preview

{
  "ClassifierId": "test-01",
  "DocTypes": {
    "CreditNote": {
      "AzureBlobFileListSource": {
        "ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
        "FileList": "CreditNote.jsonl"
      }
    },
    "Invoice": {
      "AzureBlobFileListSource": {
        "ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
        "FileList": "Invoice.jsonl"
      }
    },
    "Salary": {
      "AzureBlobFileListSource": {
        "ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
        "FileList": "Salary.jsonl"
      }
    },
    "Settlement": {
      "AzureBlobFileListSource": {
        "ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
        "FileList": "Settlement.jsonl"
      }
    }
  }
}

The Document Intelligence resource is assigned a managed system identity that has the "Storage Blob Data Contributor" role permission on the storage account. I've also tried with SAS tokens (and verified their access manually), but with the same result.

Here's the content of the container:

$ tree .
.
├── CreditNote
│   ├── 698618.pdf
│   ├── 698683.pdf
│   ├── 699183.pdf
│   ├── 699335.pdf
│   └── 699444.pdf
├── CreditNote.jsonl
├── Invoice
│   ├── 700001.pdf
│   ├── 700007.pdf
│   ├── 700021.pdf
│   ├── 700030.pdf
│   └── 700073.pdf
├── Invoice.jsonl
├── Salary
│   ├── 696517.pdf
│   ├── 696841.pdf
│   ├── 698397.pdf
│   ├── 699050.pdf
│   └── 699055.pdf
├── Salary.jsonl
├── Settlement
│   ├── 699952.pdf
│   ├── 700115.pdf
│   ├── 700139.pdf
│   ├── 700145.pdf
│   └── 700147.pdf
└── Settlement.jsonl

And the contents of the jsonl files:

$ ls | grep .jsonl | while read p; do echo $"\n$p" && cat $p; done

CreditNote.jsonl
{"File":"CreditNote/698618.pdf"}
{"File":"CreditNote/699444.pdf"}
{"File":"CreditNote/699183.pdf"}
{"File":"CreditNote/698683.pdf"}
{"File":"CreditNote/699335.pdf"}
Invoice.jsonl
{"File":"Invoice/700001.pdf"}
{"File":"Invoice/700073.pdf"}
{"File":"Invoice/700007.pdf"}
{"File":"Invoice/700030.pdf"}
{"File":"Invoice/700021.pdf"}
Salary.jsonl
{"File":"Salary/696517.pdf"}
{"File":"Salary/699050.pdf"}
{"File":"Salary/698397.pdf"}
{"File":"Salary/696841.pdf"}
{"File":"Salary/699055.pdf"}
Settlement.jsonl
{"File":"Settlement/700147.pdf"}
{"File":"Settlement/700145.pdf"}
{"File":"Settlement/700139.pdf"}
{"File":"Settlement/700115.pdf"}
{"File":"Settlement/699952.pdf"}

Do you have any ideas to what I can try now, to solve this very unspecific error message?

Any help will be much appreciated. Thank you.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,452 questions
{count} votes