Excluding Specific Site Libraries When Indexing SharePoint Subsites with Azure Cognitive Search AI

86371849 0 Reputation points
2023-11-28T11:21:43.15+00:00

I am encountering an issue with Azure Search AI (formerly known as Azure Cognitive Search). My objective is to index my SharePoint using Azure Cognitive Service so that I can integrate it with Azure OpenAI. In my scenario, I have a site, let's call it "Root," with multiple subsites such as subsite-1, subsite-2, subsite-3, and so on. I want to index all the libraries within the subsites in Azure Search. Each subsite's libraries have additional columns that I also want to include in the index. However, the libraries in the "Root" site do not have these additional columns, and I wish to skip indexing them.

Datasource:

{
"name": "prod-sharepoint-datasource",
"type": "sharepoint",
"credentials": {
    "connectionString": "SharePointOnlineEndpoint=https://xxx.sharepoint.com/sites/Root/;ApplicationId=xxx;ApplicationSecret=xxx"
},
"container": {
    "name": "useQuery",
    "query": "includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root;additionalColumns=MyCustomColumn,MyCustomColumn2,MyCustomColumn3"
} 
}

I have attempted to exclude libraries from the "Root" site using the excludeLibrary property, but it did not work as expected. Here's an example of what I tried:

{
"name": "prod-sharepoint-datasource",
"type": "sharepoint",
"credentials": {
    "connectionString": "SharePointOnlineEndpoint=https://xxx.sharepoint.com/sites/Root/;ApplicationId=xxx;ApplicationSecret=xxx"
},
"container": {
    "name": "useQuery",
    "query": "includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root;additionalColumns=MyCustomColumn,MyCustomColumn2,MyCustomColumn3;excludeLibrary=https://xxx.sharepoint.com/sites/Root/default.aspx;excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library1.aspx;excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library2.aspx;excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library3.aspx"
} 
}

I also provided the JSON for the index and the indexer configurations. If anyone has insights on how to properly exclude libraries from the "Root" site or exclude the entire "Root" site so that only the libraries from its subsites are indexed, I would greatly appreciate the assistance.

Indexes:

{
"name" : "prod-sharepoint-indexes",
"fields": [
{ "name": "column1", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "column2", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "column3", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "id", "type": "Edm.String", "key": true, "searchable": false },
{ "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
    { "name": "metadata_spo_item_weburi", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
{ "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
{ "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
{ "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
{ "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
] 
}

Indexer:

{
"name" : "prod-sharepoint-indexer",
"dataSourceName" : "prod-sharepoint-datasource",
"targetIndexName" : "prod-sharepoint-indexes",
"parameters": {
"batchSize": null,
"maxFailedItems": null,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
    "indexedFileNameExtensions" : ".pdf, .docx, .msg, .xlsx, .eml, .ppt, .doc",
    "excludedFileNameExtensions" : ".png, .jpg",
    "dataToExtract": "contentAndMetadata"
  }
},
"schedule" : { },
"fieldMappings" : [
    { 
      "sourceFieldName" : "metadata_spo_site_library_item_id", 
      "targetFieldName" : "id", 
      "mappingFunction" : { 
        "name" : "base64Encode" 
      } 
     }
] 
}
Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
851 questions
SharePoint
SharePoint
A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
10,242 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,597 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Q&A Assist
    2023-11-28T11:21:43.2266667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To exclude libraries from the "Root" site or exclude the entire "Root" site so that only the libraries from its subsites are indexed, you can use the excludeLibrary parameter in the query property of the container section in the datasource definition. However, based on the provided JSON, the excludeLibrary parameter is not being used correctly. The excludeLibrary parameter should be used as a separate query parameter and not concatenated with the includeLibrariesInSite parameter. Here's an example of how to exclude a library from the "Root" site:

    {
        "name": "prod-sharepoint-datasource",
        "type": "sharepoint",
        "credentials": {
            "connectionString": "SharePointOnlineEndpoint=https://xxx.sharepoint.com/sites/Root/;ApplicationId=xxx;ApplicationSecret=xxx"
        },
        "container": {
            "name": "useQuery",
            "query": "includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root/subsite-1;additionalColumns=MyCustomColumn,MyCustomColumn2,MyCustomColumn3&excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library1.aspx"
        }
    }
    

    In the example above, the excludeLibrary parameter is added as a separate query parameter using the & symbol to separate it from the includeLibrariesInSite parameter. This will exclude the "Library1" library from the "Root" site.

    To exclude the entire "Root" site so that only the libraries from its subsites are indexed, you can use the includeLibrariesInSite parameter to specify the subsites to include and exclude the "Root" site. Here's an example:

    {
        "name": "prod-sharepoint-datasource",
        "type": "sharepoint",
        "credentials": {
            "connectionString": "SharePointOnlineEndpoint=https://xxx.sharepoint.com/sites/Root/;ApplicationId=xxx;ApplicationSecret=xxx"
        },
        "container": {
            "name": "useQuery",
            "query": "includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root/subsite-1,https://xxx.sharepoint.com/sites/Root/subsite-2;additionalColumns=MyCustomColumn,MyCustomColumn2,MyCustomColumn3"
        }
    }
    

    In the example above, the includeLibrariesInSite parameter is used to specify the subsites to include, and the "Root" site is excluded. This will only index the libraries from the "subsite-1" and "subsite-2" subsites.


    References:

    0 comments No comments

  2. brtrach-MSFT 15,791 Reputation points Microsoft Employee
    2023-12-03T21:22:32.8+00:00

    @86371849 To exclude libraries from the "Root" site, you can use the excludeLibrary keyword in the query. However, it is possible that the issue is with the fully qualified path to the library. To ensure that you are using the correct path, we recommend navigating to the document library that you are trying to exclude and copying the URI from the browser. This is the easiest way to get the value to use with the excludeLibrary keyword in the query.

    Here's an example of how you can exclude libraries from the "Root" site:

    {
        "name": "prod-sharepoint-datasource",
        "type": "sharepoint",
        "credentials": {
            "connectionString": "SharePointOnlineEndpoint=https://xxx.sharepoint.com/sites/Root/;ApplicationId=xxx;ApplicationSecret=xxx"
        },
        "container": {
            "name": "useQuery",
            "query": "includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root;additionalColumns=MyCustomColumn,MyCustomColumn2,MyCustomColumn3;excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library1;excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library2;excludeLibrary=https://xxx.sharepoint.com/sites/Root/Library3"
        } 
    }
    
    
    

    In this example, we excluded three libraries from the "Root" site: Library1, Library2, and Library3. Please ensure that the fully qualified path to the library is correct.

    If you want to exclude the entire "Root" site so that only the libraries from its subsites are indexed, you can use the includeLibrariesInSite keyword to specify the subsites that you want to include. Here's an example:

    {
        "name": "prod-sharepoint-datasource",
        "type": "sharepoint",
        "credentials": {
            "connectionString": "SharePointOnlineEndpoint=https://xxx.sharepoint.com/sites/Root/;ApplicationId=xxx;ApplicationSecret=xxx"
        },
        "container": {
            "name": "useQuery",
            "query": "includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root/Subsite1;includeLibrariesInSite=https://xxx.sharepoint.com/sites/Root/Sub
    
    
    0 comments No comments