Training custom classifier using the REST API gives error: InvalidContentSourceFormat
I'm trying to train a customer classifier using the REST API, but I'm getting the follow error, and I can't figure out where my fault is. I've tripel-checked configurations and tried various differences...
InvalidContentSourceFormat
Invalid content source: Could not read build content.
Here's the full response:
$ GET https://westeurope.api.cognitive.microsoft.com/documentintelligence/operations/31450441458_cfd60439-37fd-4653-95f3-254fb4355723?api-version=2024-02-29-preview
{
"operationId": "31450441458_cfd60439-37fd-4653-95f3-254fb4355723",
"kind": "documentClassifierBuild",
"status": "failed",
"createdDateTime": "2024-05-17T06:15:41Z",
"lastUpdatedDateTime": "2024-05-17T06:15:42Z",
"resourceLocation": "https://westeurope.api.cognitive.microsoft.com/documentintelligence/documentClassifiers/test-01?api-version=2024-02-29-preview",
"percentCompleted": 100,
"error": {
"code": "InvalidArgument",
"message": "Invalid argument.",
"details": [
{
"code": "InvalidContentSourceFormat",
"message": "Invalid content source: Could not read build content."
}
]
},
"apiVersion": "2024-02-29-preview"
}
The following request was used to start training a new model:
$ POST https://westeurope.api.cognitive.microsoft.com/documentintelligence/documentClassifiers:build?api-version=2024-02-29-preview
{
"ClassifierId": "test-01",
"DocTypes": {
"CreditNote": {
"AzureBlobFileListSource": {
"ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
"FileList": "CreditNote.jsonl"
}
},
"Invoice": {
"AzureBlobFileListSource": {
"ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
"FileList": "Invoice.jsonl"
}
},
"Salary": {
"AzureBlobFileListSource": {
"ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
"FileList": "Salary.jsonl"
}
},
"Settlement": {
"AzureBlobFileListSource": {
"ContainerUrl": "https://<storageaccount>.blob.core.windows.net/<container>",
"FileList": "Settlement.jsonl"
}
}
}
}
The Document Intelligence resource is assigned a managed system identity that has the "Storage Blob Data Contributor" role permission on the storage account. I've also tried with SAS tokens (and verified their access manually), but with the same result.
Here's the content of the container:
$ tree .
.
├── CreditNote
│ ├── 698618.pdf
│ ├── 698683.pdf
│ ├── 699183.pdf
│ ├── 699335.pdf
│ └── 699444.pdf
├── CreditNote.jsonl
├── Invoice
│ ├── 700001.pdf
│ ├── 700007.pdf
│ ├── 700021.pdf
│ ├── 700030.pdf
│ └── 700073.pdf
├── Invoice.jsonl
├── Salary
│ ├── 696517.pdf
│ ├── 696841.pdf
│ ├── 698397.pdf
│ ├── 699050.pdf
│ └── 699055.pdf
├── Salary.jsonl
├── Settlement
│ ├── 699952.pdf
│ ├── 700115.pdf
│ ├── 700139.pdf
│ ├── 700145.pdf
│ └── 700147.pdf
└── Settlement.jsonl
And the contents of the jsonl files:
$ ls | grep .jsonl | while read p; do echo $"\n$p" && cat $p; done
CreditNote.jsonl
{"File":"CreditNote/698618.pdf"}
{"File":"CreditNote/699444.pdf"}
{"File":"CreditNote/699183.pdf"}
{"File":"CreditNote/698683.pdf"}
{"File":"CreditNote/699335.pdf"}
Invoice.jsonl
{"File":"Invoice/700001.pdf"}
{"File":"Invoice/700073.pdf"}
{"File":"Invoice/700007.pdf"}
{"File":"Invoice/700030.pdf"}
{"File":"Invoice/700021.pdf"}
Salary.jsonl
{"File":"Salary/696517.pdf"}
{"File":"Salary/699050.pdf"}
{"File":"Salary/698397.pdf"}
{"File":"Salary/696841.pdf"}
{"File":"Salary/699055.pdf"}
Settlement.jsonl
{"File":"Settlement/700147.pdf"}
{"File":"Settlement/700145.pdf"}
{"File":"Settlement/700139.pdf"}
{"File":"Settlement/700115.pdf"}
{"File":"Settlement/699952.pdf"}
Do you have any ideas to what I can try now, to solve this very unspecific error message?
Any help will be much appreciated. Thank you.