Cuir in eagar

Comhroinn trí


Content metadata properties used in Azure AI Search

Several indexer-supported data sources, including Azure Blob Storage, Azure Data Lake Storage Gen2, and SharePoint, contain standalone files or embedded objects of various content types. Many of those content types have metadata properties that can be useful to index. Just as you can create search fields for standard blob properties like metadata_storage_name, you can create fields in a search index for metadata properties that are specific to a document format.

Supported document formats

Azure AI Search supports blob indexing and SharePoint document indexing for the following document formats:

  • CSV (see Indexing CSV blobs)
  • EML
  • EPUB
  • GZ
  • HTML
  • JSON (see Indexing JSON blobs)
  • KML (XML for geographic representations)
  • Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
  • Open Document formats: ODT, ODS, ODP
  • PDF
  • Plain text files (see also Indexing plain text)
  • RTF
  • XML
  • ZIP

Document format properties

The following table summarizes processing for each document format, and describes the metadata properties extracted by a blob indexer and the SharePoint Online indexer.

Document format / content type Extracted metadata Processing details
CSV (text/csv) metadata_content_type
metadata_content_encoding
Extract text
NOTE: If you need to extract multiple document fields from a CSV blob, see Index CSV blobs
DOC (application/msword) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
DOCM (application/vnd.ms-word.document.macroenabled.12) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
EML (message/rfc822) metadata_content_type
metadata_message_from
metadata_message_to
metadata_message_cc
metadata_creation_date
metadata_subject
Extract text, including attachments
EPUB (application/epub+zip) metadata_content_type
metadata_author
metadata_creation_date
metadata_title
metadata_description
metadata_language
metadata_keywords
metadata_identifier
metadata_publisher
Extract text from all documents in the archive
GZ (application/gzip) metadata_content_type Extract text from all documents in the archive
HTML (text/html or application/xhtml+xml) metadata_content_encoding
metadata_content_type
metadata_language
metadata_description
metadata_keywords
metadata_title
Strip HTML elements and extract text
JSON (application/json) metadata_content_type
metadata_content_encoding
Extract text
NOTE: If you need to extract multiple document fields from a JSON blob, see Index JSON blobs
KML (application/vnd.google-earth.kml+xml) metadata_content_type
metadata_content_encoding
metadata_language
Strip XML elements and extract text
MSG (application/vnd.ms-outlook) metadata_content_type
metadata_message_from
metadata_message_from_email
metadata_message_to
metadata_message_to_email
metadata_message_cc
metadata_message_cc_email
metadata_message_bcc
metadata_message_bcc_email
metadata_creation_date
metadata_last_modified
metadata_subject
Extract text, including text extracted from attachments. metadata_message_to_email, metadata_message_cc_email, and metadata_message_bcc_email are string collections. The rest of the fields are strings.
ODP (application/vnd.oasis.opendocument.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_title
Extract text, including embedded documents
ODS (application/vnd.oasis.opendocument.spreadsheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
ODT (application/vnd.oasis.opendocument.text) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text, including embedded documents
PDF (application/pdf) metadata_content_type
metadata_language
metadata_author
metadata_title
metadata_creation_date
Extract text, including embedded documents (excluding images)
Plain text (text/plain) metadata_content_type
metadata_content_encoding
metadata_language
Extract text
PPT (application/vnd.ms-powerpoint) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
Extract text, including embedded documents
PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
Extract text, including embedded documents
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
Extract text, including embedded documents
RTF (application/rtf) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Extract text
WORD 2003 XML (application/vnd.ms-wordml) metadata_content_type
metadata_author
metadata_creation_date
Strip XML elements and extract text
WORD XML (application/vnd.ms-word2006ml) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
Strip XML elements and extract text
XLS (application/vnd.ms-excel) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
XLSM (application/vnd.ms-excel.sheet.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
Extract text, including embedded documents
XML (application/xml) metadata_content_type
metadata_content_encoding
metadata_language
Strip XML elements and extract text
ZIP (application/zip) metadata_content_type Extract text from all documents in the archive