Index data from SharePoint document libraries

Important

SharePoint indexer support is currently in public preview under Supplemental Terms of Use. Request access to this feature, and after access is enabled, use a preview REST API (2020-06-30-preview or later) to index your content. There is currently limited portal support and no .NET SDK support.

This article explains how to configure a search indexer to index documents stored in SharePoint document libraries for full text search in Azure Cognitive Search. Configuration steps are followed by a deeper exploration of behaviors and scenarios you're likely to encounter.

Note

SharePoint supports a granular authorization model that determines per-user access at the document level. The SharePoint indexer does not pull these permissions into the search index, and Cognitive Search does not support document-level authorization. When a document is indexed from SharePoint into a search service, the content is available to anyone who has read access to the index. If you require document-level permissions, you should investigate security filters to trim results of unauthorized content.

Functionality

An indexer in Azure Cognitive Search is a crawler that extracts searchable data and metadata from a data source. The SharePoint indexer will connect to your SharePoint site and index documents from one or more document libraries. The indexer provides the following functionality:

  • Index content and metadata from one or more document libraries.
  • Incremental indexing, where the indexer identifies which files have changed and indexes only the updated content. For example, if five PDFs are originally indexed and one is updated, only the updated PDF is indexed.
  • Deletion detection is built in. If a document is deleted from a document library, the indexer will detect the delete on the next indexer run and remove the document from the index.
  • Text and normalized images will be extracted by default from the documents that are indexed. Optionally a skillset can be added to the pipeline for AI enrichment.

Prerequisites

Supported document formats

The SharePoint indexer can extract text from the following document formats:

  • CSV (see Indexing CSV blobs)
  • EML
  • EPUB
  • GZ
  • HTML
  • JSON (see Indexing JSON blobs)
  • KML (XML for geographic representations)
  • Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
  • Open Document formats: ODT, ODS, ODP
  • PDF
  • Plain text files (see also Indexing plain text)
  • RTF
  • XML
  • ZIP

Configure the SharePoint indexer

To set up the SharePoint indexer, you'll need to perform some tasks in the Azure portal and others through the preview REST API.

The following video shows you how to set up the SharePoint indexer.

Step 1 (Optional): Enable system assigned managed identity

When a system-assigned managed identity is enabled, Azure creates an identity for your search service that can be used by the indexer. This identity is used to automatically detect the tenant the search service is provisioned in.

If the SharePoint site is in the same tenant as the search service, you'll need to enable the system-assigned managed identity for the search service in the Azure portal. If the SharePoint site is in a different tenant from the search service, skip this step.

Enable system assigned managed identity

After selecting Save you'll see an Object ID that has been assigned to your search service.

System assigned managed identity

Step 2: Decide which permissions the indexer requires

The SharePoint indexer supports both delegated and application permissions. Choose which permissions you want to use based on your scenario:

  • Delegated permissions, where the indexer runs under the identity of the user or app sending the request. Data access is limited to the sites and files to which the user has access. To support delegated permissions, the indexer requires a device code prompt to sign in on behalf of the user.

  • Application permissions, where the indexer runs under the identity of the SharePoint tenant with access to all sites and files within the SharePoint tenant. The indexer requires a client secret to access the SharePoint tenant. The indexer will also require tenant admin approval before it can index any content.

If your Azure Active Directory organization has Conditional Access enabled and your administrator isn't able to grant any device access for Delegated permissions, you should consider Application permissions instead. For more information, see SharePoint Conditional Access policies.

Step 3: Create an Azure AD application

The SharePoint indexer will use this Azure Active Directory (Azure AD) application for authentication.

  1. Sign in to Azure portal.

  2. Search for or navigate to Azure Active Directory, then select App registrations.

  3. Select + New registration:

    1. Provide a name for your app.
    2. Select Single tenant.
    3. Skip the URI designation step. No redirect URI required.
    4. Select Register.
  4. On the left, select API permissions, then Add a permission, then Microsoft Graph.

    • If the indexer is using delegated API permissions, select Delegated permissions and add the following:

      • Delegated - Files.Read.All
      • Delegated - Sites.Read.All
      • Delegated - User.Read

      Delegated API permissions

      Delegated permissions allow the search client to connect to SharePoint under the security identity of the current user.

    • If the indexer is using application API permissions, then select Application permissions and add the following:

      • Application - Files.Read.All
      • Application - Sites.Read.All

      Application API permissions

      Using application permissions means that the indexer will access the SharePoint site in a service context. So when you run the indexer it will have access to all content in the SharePoint tenant, which requires tenant admin approval. A client secret is also required for authentication. Setting up the client secret is described later in this article.

  5. Give admin consent.

    Tenant admin consent is required when using application API permissions. Some tenants are locked down in such a way that tenant admin consent is required for delegated API permissions as well. If either of these conditions apply, you’ll need to have a tenant admin grant consent for this Azure AD application before creating the indexer.

    Azure AD app grant admin consent

  6. Select the Authentication tab.

  7. Set Allow public client flows to Yes then select Save.

  8. Select + Add a platform, then Mobile and desktop applications, then check https://login.microsoftonline.com/common/oauth2/nativeclient, then Configure.

    Azure AD app authentication configuration

  9. (Application API Permissions only) To authenticate to the Azure AD application using application permissions, the indexer requires a client secret.

    • Select Certificates & Secrets from the menu on the left, then Client secrets, then New client secret.

      New client secret

    • In the menu that pops up, enter a description for the new client secret. Adjust the expiration date if necessary. If the secret expires, it will need to be recreated and the indexer needs to be updated with the new secret.

      Setup client secret

    • The new client secret will appear in the secret list. Once you navigate away from the page the secret will no longer be visible, so copy it using the copy button and save it in a secure location.

      Copy client secret

Step 4: Create data source

Important

Starting in this section you need to use the preview REST API for the remaining steps. If you’re not familiar with the Azure Cognitive Search REST API, we suggest taking a look at this Quickstart.

A data source specifies which data to index, credentials needed to access the data, and policies to efficiently identify changes in the data (new, modified, or deleted rows). A data source can be used by multiple indexers in the same search service.

For SharePoint indexing, the data source must have the following required properties:

  • name is the unique name of the data source within your search service.
  • type must be "sharepoint". This value is case-sensitive.
  • credentials provide the SharePoint endpoint and the Azure AD application (client) ID. An example SharePoint endpoint is https://microsoft.sharepoint.com/teams/MySharePointSite. You can get the endpoint by navigating to the home page of your SharePoint site and copying the URL from the browser.
  • container specifies which document library to index. More information on creating the container can be found in the Controlling which documents are indexed section of this document.

To create a data source, call Create Data Source using preview API version 2020-06-30-Preview or later.

POST https://[service name].search.windows.net/datasources?api-version=2020-06-30-Preview
Content-Type: application/json
api-key: [admin key]

{
    "name" : "sharepoint-datasource",
    "type" : "sharepoint",
    "credentials" : { "connectionString" : "[connection-string]" },
    "container" : { "name" : "defaultSiteLibrary", "query" : null }
}

Connection string format

The format of the connection string changes based on whether the indexer is using delegated API permissions or application API permissions

  • Delegated API permissions connection string format

    SharePointOnlineEndpoint=[SharePoint site url];ApplicationId=[Azure AD App ID];TenantId=[SharePoint site tenant id]

  • Application API permissions connection string format

    SharePointOnlineEndpoint=[SharePoint site url];ApplicationId=[Azure AD App ID];ApplicationSecret=[Azure AD App client secret];TenantId=[SharePoint site tenant id]

Note

If the SharePoint site is in the same tenant as the search service and system-assigned managed identity is enabled, TenantId doesn't have to be included in the connection string. If the SharePoint site is in a different tenant from the search service, TenantId must be included.

Step 5: Create an index

The index specifies the fields in a document, attributes, and other constructs that shape the search experience.

To create an index, call Create Index:

POST https://[service name].search.windows.net/indexes?api-version=2020-06-30
Content-Type: application/json
api-key: [admin key]

{
    "name" : "sharepoint-index",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
        { "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
        { "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
    ]
}

Important

Only metadata_spo_site_library_item_id may be used as the key field in an index populated by the SharePoint indexer. If a key field doesn't exist in the data source, metadata_spo_site_library_item_id is automatically mapped to the key field.

Step 6: Create an indexer

An indexer connects a data source with a target search index and provides a schedule to automate the data refresh. Once the index and data source have been created, you're ready to create the indexer.

During this section you’ll be asked to sign in with your organization credentials that have access to the SharePoint site. If possible, we recommend creating a new organizational user account and giving that new user the exact permissions that you want the indexer to have.

There are a few steps to creating the indexer:

  1. Send a Create Indexer request:

    POST https://[service name].search.windows.net/indexers?api-version=2020-06-30-Preview
    Content-Type: application/json
    api-key: [admin key]
    
    {
        "name" : "sharepoint-indexer",
        "dataSourceName" : "sharepoint-datasource",
        "targetIndexName" : "sharepoint-index",
        "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration": {
            "indexedFileNameExtensions" : ".pdf, .docx",
            "excludedFileNameExtensions" : ".png, .jpg",
            "dataToExtract": "contentAndMetadata"
          }
        },
        "schedule" : { },
        "fieldMappings" : [
            { 
              "sourceFieldName" : "metadata_spo_site_library_item_id", 
              "targetFieldName" : "id", 
              "mappingFunction" : { 
                "name" : "base64Encode" 
              } 
             }
        ]
    }
    
  2. When creating the indexer for the first time it will fail and you’ll see the following error. Go to the link in the error message. If you don’t go to the link within 10 minutes the code will expire and you’ll need to recreate the data source.

    {
        "error": {
            "code": "",
            "message": "Error with data source: To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code <CODE> to authenticate.  Please adjust your data source definition in order to proceed."
        }
    }
    
  3. Provide the code that was included in the error message.

    Enter device code

  4. The SharePoint indexer will access the SharePoint content as the signed-in user. The user that logs in during this step will be that signed-in user. So, if you sign in with a user account that doesn’t have access to a document in the Document Library that you want to index, the indexer won’t have access to that document.

    If possible, we recommend creating a new user account and giving that new user the exact permissions that you want the indexer to have.

  5. Approve the permissions that are being requested.

    Approve API permissions

  6. Resend the indexer create request. This time the request should succeed.

    POST https://[service name].search.windows.net/indexers?api-version=2020-06-30-Preview
    Content-Type: application/json
    api-key: [admin key]
    
    {
        "name" : "sharepoint-indexer",
        "dataSourceName" : "sharepoint-datasource",
        "targetIndexName" : "sharepoint-index",
        "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration:" {
            "dataToExtract": "contentAndMetadata",
            "indexedFileNameExtensions" : ".pdf, .docx",
            "excludedFileNameExtensions" : ".png, .jpg"
          }
        },
        "schedule" : { },
        "fieldMappings" : [
            { 
              "sourceFieldName" : "metadata_spo_site_library_item_id", 
              "targetFieldName" : "id", 
              "mappingFunction" : { 
                "name" : "base64Encode" 
              } 
            }
        ]
    }
    

Note

If the Azure AD application requires admin approval and was not approved before logging in, you may see the following screen. Admin approval is required to continue. Admin approval required

Step 7: Check the indexer status

After the indexer has been created, you can call Get Indexer Status:

GET https://[service name].search.windows.net/indexers/sharepoint-indexer/status?api-version=2020-06-30-Preview
Content-Type: application/json
api-key: [admin key]

Updating the data source

If there are no updates to the data source object, the indexer can run on a schedule without any user interaction. However, every time the Azure Cognitive Search data source object is updated, you'll need to sign in again in order for the indexer to run. For example, if you change the data source query, sign in again using the https://microsoft.com/devicelogin and a new code.

Once the data source has been updated, follow the below steps:

  1. Call Run Indexer to manually kick off indexer execution.

    POST https://[service name].search.windows.net/indexers/sharepoint-indexer/run?api-version=2020-06-30-Preview  
    Content-Type: application/json
    api-key: [admin key]
    
  2. Check the indexer status. If the last indexer run has an error telling you to go to https://microsoft.com/devicelogin, go to that page and provide the new code.

    GET https://[service name].search.windows.net/indexers/sharepoint-indexer/status?api-version=2020-06-30-Preview
    Content-Type: application/json
    api-key: [admin key]
    
  3. Login.

  4. Manually run the indexer again and check the indexer status. This time the indexer run should successfully start.

Indexing document metadata

If you have set the indexer to index document metadata ("dataToExtract": "contentAndMetadata"), the following metadata will be available to index.

Identifier Type Description
metadata_spo_site_library_item_id Edm.String The combination key of site ID, library ID, and item ID which uniquely identifies an item in a document library for a site.
metadata_spo_site_id Edm.String The ID of the SharePoint site.
metadata_spo_library_id Edm.String The ID of document library.
metadata_spo_item_id Edm.String The ID of the (document) item in the library.
metadata_spo_item_last_modified Edm.DateTimeOffset The last modified date/time (UTC) of the item.
metadata_spo_item_name Edm.String The name of the item.
metadata_spo_item_size Edm.Int64 The size (in bytes) of the item.
metadata_spo_item_content_type Edm.String The content type of the item.
metadata_spo_item_extension Edm.String The extension of the item.
metadata_spo_item_weburi Edm.String The URI of the item.
metadata_spo_item_path Edm.String The combination of the parent path and item name.

The SharePoint indexer also supports metadata specific to each document type. More information can be found in Content metadata properties used in Azure Cognitive Search.

Note

To index custom metadata, "additionalColumns" must be specified in the query parameter of the data source.

Include or exclude by file type

You can control which files are indexed by setting inclusion and exclusion criteria in the "parameters" section of the indexer definition.

Include specific file extensions by setting "indexedFileNameExtensions" to a comma-separated list of file extensions (with a leading dot). Exclude specific file extensions by setting "excludedFileNameExtensions" to the extensions that should be skipped. If the same extension is in both lists, it will be excluded from indexing.

PUT /indexers/[indexer name]?api-version=2020-06-30
{
    "parameters" : { 
        "configuration" : { 
            "indexedFileNameExtensions" : ".pdf, .docx",
            "excludedFileNameExtensions" : ".png, .jpeg" 
        } 
    }
}

Controlling which documents are indexed

A single SharePoint indexer can index content from one or more document libraries. Use the "container" parameter on the data source definition to indicate which sites and document libraries to index from. T The data source "container" section has two properties for this task: "name" and "query".

Name

The "name" property is required and must be one of three values:

Value Description
defaultSiteLibrary Index all content from the site's default document library.
allSiteLibraries Index all content from all document libraries in a site. Document libraries from a subsite are out of scope/ If you need content from subsites, choose "useQuery" and specify "includeLibrariesInSite".
useQuery Only index the content defined in the "query".

Query

The "query" parameter of the data source is made up of keyword/value pairs. The below are the keywords that can be used. The values are either site URLs or document library URLs.

Note

To get the value for a particular keyword, we recommend navigating to the document library that you’re trying to include/exclude and copying the URI from the browser. This is the easiest way to get the value to use with a keyword in the query.

Keyword Value description and examples
null If null or empty, index either the default document library or all document libraries depending on the container name.

Example:

"container" : { "name" : "defaultSiteLibrary", "query" : null }
includeLibrariesInSite Index content from all libraries under the specified site in the connection string. The scope includes any subsites of your site. The value should be the URI of the site or subsite.

Example:

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mycompany.sharepoint.com/mysite" }
includeLibrary Index all content from this library. The value is the fully qualified path to the library, which can be copied from your browser:

Example 1 (fully qualified path):

"container" : { "name" : "useQuery", "query" : "includeLibrary=https://mycompany.sharepoint.com/mysite/MyDocumentLibrary" }

Example 2 (URI copied from your browser):

"container" : { "name" : "useQuery", "query" : "includeLibrary=https://mycompany.sharepoint.com/teams/mysite/MyDocumentLibrary/Forms/AllItems.aspx" }
excludeLibrary Don't index content from this library. The value is the fully qualified path to the library, which can be copied from your browser:

Example 1 (fully qualified path):

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mysite.sharepoint.com/subsite1; excludeLibrary=https://mysite.sharepoint.com/subsite1/MyDocumentLibrary" }

Example 2 (URI copied from your browser):

"container" : { "name" : "useQuery", "query" : "includeLibrariesInSite=https://mycompany.sharepoint.com/teams/mysite; excludeLibrary=https://mycompany.sharepoint.com/teams/mysite/MyDocumentLibrary/Forms/AllItems.aspx" }
additionalColumns Index columns from the document library. The value is a comma-separated list of column names you want to index. Use a double backslash to escape semicolons and commas in column names:

Example 1 (additionalColumns=MyCustomColumn,MyCustomColumn2):

"container" : { "name" : "useQuery", "query" : "includeLibrary=https://mycompany.sharepoint.com/mysite/MyDocumentLibrary;additionalColumns=MyCustomColumn,MyCustomColumn2" }

Example 2 (escape characters using double backslash):

"container" : { "name" : "useQuery", "query" : "includeLibrary=https://mycompany.sharepoint.com/teams/mysite/MyDocumentLibrary/Forms/AllItems.aspx;additionalColumns=MyCustomColumnWith\\,,MyCustomColumnWith\\;" }

Handling errors

By default, the SharePoint indexer stops as soon as it encounters a document with an unsupported content type (for example, an image). You can use the excludedFileNameExtensions parameter to skip certain content types. However, you may need to index documents without knowing all the possible content types in advance. To continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType configuration parameter to false:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2020-06-30-Preview
Content-Type: application/json
api-key: [admin key]

{
    ... other parts of indexer definition
    "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}

For some documents, Azure Cognitive Search is unable to determine the content type, or unable to process a document of otherwise supported content type. To ignore this failure mode, set the failOnUnprocessableDocument configuration parameter to false:

"parameters" : { "configuration" : { "failOnUnprocessableDocument" : false } }

Azure Cognitive Search limits the size of documents that are indexed. These limits are documented in Service Limits in Azure Cognitive Search. Oversized documents are treated as errors by default. However, you can still index storage metadata of oversized documents if you set indexStorageMetadataOnlyForOversizedDocuments configuration parameter to true:

"parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }

You can also continue indexing if errors happen at any point of processing, either while parsing documents or while adding documents to an index. To ignore a specific number of errors, set the maxFailedItems and maxFailedItemsPerBatch configuration parameters to the desired values. For example:

{
    ... other parts of indexer definition
    "parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
}

See also