Requirements and limitations for a document processing model
Languages supported
The following languages are supported when training a document processing model and selecting Structured and semi-structured documents as document type: Afrikaans, Albanian, Angika (Devanagari), Arabic, Asturian, Awadhi-Hindi (Devanagari), Azerbaijani (Latin), Bagheli, Basque, Belarusian (Cyrillic), Belarusian (Latin), Bhojpuri-Hindi (Devanagari), Bislama, Bodo (Devanagari), Bosnian (Latin), Brajbha, Breton, Bulgarian, Bundeli, Buryat (Cyrillic), Catalan, Cebuano, Chamling, Chamorro, Chhattisgarhi (Devanagari), Chinese Simplified, Chinese Traditional, Cornish, Corsican, Crimean Tatar (Latin), Croatian, Czech, Danish, Dari, Dhimal (Devanagari), Dogri (Devanagari), Dutch, English, Erzya (Cyrillic), Estonian, Faroese, Fijian, Filipino, Finnish, French, Friulian, Gagauz (Latin), Galician, German, Gilbertese, Gondi (Devanagari), Greenlandic, Gurung (Devanagari), Haitian Creole, Halbi (Devanagari), Hani, Haryanvi, Hawaiian, Hindi, Hmong Daw (Latin), Ho(Devanagiri), Hungarian, Icelandic, Inari Sami, Indonesian, Interlingua, Inuktitut (Latin), Irish, Italian, Japanese, Jaunsari (Devanagari), Javanese, Kabuverdianu, Kachin (Latin), Kangri (Devanagari), Karachay-Balkar, Kara-Kalpak (Cyrillic), Kara-Kalpak (Latin), Kashubian, Kazakh (Cyrillic), Kazakh (Latin), Khaling, Khasi, K'iche', Korean, Korku, Koryak, Kosraean, Kumyk (Cyrillic), Kurdish (Arabic), Kurdish (Latin), Kurukh (Devanagari), Kyrgyz (Cyrillic), Lakota, Latin, Lithuanian, Lower Sorbian, Lule Sami, Luxembourgish, Mahasu Pahari (Devanagari), Malay (Latin), Maltese, Malto (Devanagari), Manx, Maori, Marathi, Mongolian (Cyrillic), Montenegrin (Cyrillic), Montenegrin (Latin), Neapolitan, Nepali, Niuean, Nogay, Northern Sami (Latin), Norwegian, Occitan, Ossetic, Pashto, Persian, Polish, Portuguese, Punjabi (Arabic), Ripuarian, Romanian, Romansh, Russian, Sadri (Devanagari), Samoan (Latin), Sanskrit (Devanagari), Santali(Devanagiri), Scots, Scottish Gaelic, Serbian (Latin), Sherpa (Devanagari), Sirmauri (Devanagari), Skolt Sami, Slovak, Slovenian, Somali (Arabic), Southern Sami, Spanish, Swahili (Latin), Swedish, Tajik (Cyrillic), Tatar (Latin), Tetum, Thangmi, Tongan, Turkish, Turkmen (Latin), Tuvan, Upper Sorbian, Urdu, Uyghur (Arabic), Uzbek (Arabic), Uzbek (Cyrillic), Uzbek (Latin), Volapük, Walser, Welsh, Western Frisian, Yucatec Maya, Zhuang, Zulu
The following language is supported when training a document processing model and selecting Unstructured and free-form documents as document type: English
Requirements
Document processing works on input documents that meet the following requirements:
- JPG, PNG, or PDF format (text or scanned). Text-embedded PDFs are better, because there won't be any errors in character extraction and location.
- TIFF files can't be used for training. You'll need to use documents in PDF, JPG or PNG format to train a model. Once the model has been trained, it can extract data from TIFF files when the model is used in a Power Automate cloud flow.
- If your PDFs are password-locked, you must remove the lock before submitting them.
- Maximum document size to process must not exceed 20 MB.
- For images, dimensions must be between 50 × 50 and 10,000 × 10,000 pixels.
- If scanned from paper documents, scans should be high-quality images.
- You can create up to 200 collections per model.
- In a cloud flow, the limit of fields that can be tagged for document processing is 300.
Note
- Extracting signatures from documents isn't currently supported.
- Fields that split across page boundaries aren't currently supported.
- Rows that break from one page to another aren't currently supported.
Optimization tip
Learn how to improve the performance of document processing models.
Next step
Feedback
Submit and view feedback for