Content metadata properties used in Azure AI Search
Several indexer-supported data sources, including Azure Blob Storage, Azure Data Lake Storage Gen2, and SharePoint, contain standalone files or embedded objects of various content types. Many of those content types have metadata properties that can be useful to index. Just as you can create search fields for standard blob properties like metadata_storage_name
, you can create fields in a search index for metadata properties that are specific to a document format.
Supported document formats
Azure AI Search supports blob indexing and SharePoint document indexing for the following document formats:
- CSV (see Indexing CSV blobs)
- EML
- EPUB
- GZ
- HTML
- JSON (see Indexing JSON blobs)
- KML (XML for geographic representations)
- Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
- Open Document formats: ODT, ODS, ODP
- Plain text files (see also Indexing plain text)
- RTF
- XML
- ZIP
Document format properties
The following table summarizes processing for each document format, and describes the metadata properties extracted by a blob indexer and the SharePoint Online indexer.
Document format / content type | Extracted metadata | Processing details |
---|---|---|
CSV (text/csv) | metadata_content_type metadata_content_encoding |
Extract text NOTE: If you need to extract multiple document fields from a CSV blob, see Index CSV blobs |
DOC (application/msword) | metadata_content_type metadata_author metadata_character_count metadata_creation_date metadata_last_modified metadata_page_count metadata_word_count |
Extract text, including embedded documents |
DOCM (application/vnd.ms-word.document.macroenabled.12) | metadata_content_type metadata_author metadata_character_count metadata_creation_date metadata_last_modified metadata_page_count metadata_word_count |
Extract text, including embedded documents |
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) | metadata_content_type metadata_author metadata_character_count metadata_creation_date metadata_last_modified metadata_page_count metadata_word_count |
Extract text, including embedded documents |
EML (message/rfc822) | metadata_content_type metadata_message_from metadata_message_to metadata_message_cc metadata_creation_date metadata_subject |
Extract text, including attachments |
EPUB (application/epub+zip) | metadata_content_type metadata_author metadata_creation_date metadata_title metadata_description metadata_language metadata_keywords metadata_identifier metadata_publisher |
Extract text from all documents in the archive |
GZ (application/gzip) | metadata_content_type |
Extract text from all documents in the archive |
HTML (text/html or application/xhtml+xml) | metadata_content_encoding metadata_content_type metadata_language metadata_description metadata_keywords metadata_title |
Strip HTML elements and extract text |
JSON (application/json) | metadata_content_type metadata_content_encoding |
Extract text NOTE: If you need to extract multiple document fields from a JSON blob, see Index JSON blobs |
KML (application/vnd.google-earth.kml+xml) | metadata_content_type metadata_content_encoding metadata_language |
Strip XML elements and extract text |
MSG (application/vnd.ms-outlook) | metadata_content_type metadata_message_from metadata_message_from_email metadata_message_to metadata_message_to_email metadata_message_cc metadata_message_cc_email metadata_message_bcc metadata_message_bcc_email metadata_creation_date metadata_last_modified metadata_subject |
Extract text, including text extracted from attachments. metadata_message_to_email , metadata_message_cc_email , and metadata_message_bcc_email are string collections. The rest of the fields are strings. |
ODP (application/vnd.oasis.opendocument.presentation) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified metadata_title |
Extract text, including embedded documents |
ODS (application/vnd.oasis.opendocument.spreadsheet) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified |
Extract text, including embedded documents |
ODT (application/vnd.oasis.opendocument.text) | metadata_content_type metadata_author metadata_character_count metadata_creation_date metadata_last_modified metadata_page_count metadata_word_count |
Extract text, including embedded documents |
PDF (application/pdf) | metadata_content_type metadata_language metadata_author metadata_title metadata_creation_date |
Extract text, including embedded documents (excluding images) |
Plain text (text/plain) | metadata_content_type metadata_content_encoding metadata_language |
Extract text |
PPT (application/vnd.ms-powerpoint) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified metadata_slide_count metadata_title |
Extract text, including embedded documents |
PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified metadata_slide_count metadata_title |
Extract text, including embedded documents |
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified metadata_slide_count metadata_title |
Extract text, including embedded documents |
RTF (application/rtf) | metadata_content_type metadata_author metadata_character_count metadata_creation_date metadata_last_modified metadata_page_count metadata_word_count |
Extract text |
WORD 2003 XML (application/vnd.ms-wordml) | metadata_content_type metadata_author metadata_creation_date |
Strip XML elements and extract text |
WORD XML (application/vnd.ms-word2006ml) | metadata_content_type metadata_author metadata_character_count metadata_creation_date metadata_last_modified metadata_page_count metadata_word_count |
Strip XML elements and extract text |
XLS (application/vnd.ms-excel) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified |
Extract text, including embedded documents |
XLSM (application/vnd.ms-excel.sheet.macroenabled.12) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified |
Extract text, including embedded documents |
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) | metadata_content_type metadata_author metadata_creation_date metadata_last_modified |
Extract text, including embedded documents |
XML (application/xml) | metadata_content_type metadata_content_encoding metadata_language |
Strip XML elements and extract text |
ZIP (application/zip) | metadata_content_type |
Extract text from all documents in the archive |