Enterprise Websites cloud Microsoft Graph connector

Članek
10/25/2024

The Enterprise Websites cloud Microsoft Graph connector allows your organization to index webpages and content from your company-owned websites or public websites on the internet. After you configure the connector and index content from the website, end users can search for that content in Microsoft Search and Microsoft 365 Copilot.

This article is for Microsoft 365 administrators or anyone who configures, runs, and monitors a Enterprise Websites cloud Microsoft Graph connector.

Important

You may utilize the Enterprise Websites on-premises Microsoft Graph connector to index websites hosted on-premises or on private clouds.

Capabilities

Index webpages from cloud accessible websites.
Index up to 50 websites in a single connection.
Exclude webpages from crawl using exclusion rules.
Use Semantic search in Copilot to enable users to find relevant content.

Supported file types

File Extension	File Type	Description
.pdf	PDF	Portable Document Format
.odt	OpenDocument Text	OpenDocument Text Document
.ods	OpenDocument Spreadsheet	OpenDocument Spreadsheet
.odp	OpenDocument Presentation	OpenDocument Presentation
.odg	OpenDocument Graphics	OpenDocument Graphics
.xls	Excel (Old)	Excel Spreadsheet (Old Format)
.xlsx	Excel (New)	Excel Spreadsheet (New Format)
.ppt	PowerPoint (Old)	PowerPoint Presentation (Old Format)
.pptx	PowerPoint (New)	PowerPoint Presentation (New Format)
.doc	Word (Old)	Word Document (Old Format)
.docx	Word (New)	Word Document (New Format)
.csv	CSV	Comma-Separated Values
.txt	Plain Text	Plain Text File
.xml	XML	Extensible Markup Language
.md	Markdown	Markdown File
.rtf	Rich Text Format	Rich Text Format
.tsv	Tab Separated Values	Tab-Separated Values

Supported MIME types

MIME Type	Description
text/html	HyperText Markup Language (HTML) used to format the structure of a webpage.
text/webviewhtml	MIME type used for web content rendered in WebView controls.
text/x-server-parsed-html	Server-parsed HTML documents, often used for Server Side Includes (SSI).

Limitations

The connector doesn't support authentication mechanisms like SAML, JWT token, Forms-based authentication, etc.
The connector doesn't support crawling of dynamic content in webpages.

Prerequisites

You must be the search admin for your organization's Microsoft 365 tenant.
Website URLs: To connect to your website content, you need the URL to the website. You can index multiple websites (up to 50) in a single connection.
Service Account (optional): A service account is only needed when your websites require authentication. Public websites don't require authentication and can be crawled directly. For websites requiring authentication, it is advised to have a dedicated account to authenticate and crawl the content.

Get Started

1. Display name

A display name is used to identify each citation in Copilot, helping users easily recognize the associated file or item. Display name also signifies trusted content. Display name is also used as a content source filter. A default value is present for this field, but you can customize it to a name that users in your organization recognize.

2. Website URLs to index

Specify the root of the website that you'd like to crawl. The Enterprise Websites cloud Microsoft Graph connector uses this URL as the starting point and follow all the links from this URL for its crawl. You can index up to 50 different site URLs in a single connection. In the URLs field, enter the site URLs separated by commas (,). For example, https://www.contoso.com,https://www.contosoelectronics.com.

Note

The connector always starts crawling from the root of the URL. For example - if your provided URL is https://www.contoso.com/electronics, then the connector will start crawl from https://www.contoso.com.

The connector only crawls webpages in the domain of root URLs and doesn't support crawling of out-of-domain URLs. Redirection is only supported within the same domain. If there are redirections in the webpages to be crawled, you may add the redirected URL directly in list of URLs to be crawled.

Use sitemap for crawling

When selected, the connector only crawls the URLs listed in the sitemap. This option also allows you to configure incremental crawling during a later step. If not selected or no sitemap is found, the connector does a deep crawl of all the links found on the root URL of the site.

When this option is selected, the crawler performs the following steps:

a. The crawler looks for the robots.txt file in the root location. For example - if your provided URL is https://www.contoso.com, then the crawler looks for the robots.txt file at https://www.contoso.com/robots.txt.

b. Upon locating the robots.txt file, the crawler finds the sitemap links in the robots.txt file.

c. The crawler then crawls all webpages as listed in the sitemap files.

d. If there is failure in any of the above steps, the crawler performs a deep crawl of the website, without throwing any error.

3. Authentication Type

The authentication method you choose applies for all websites you have provided to index in a connection. To authenticate and sync content from websites, choose one of the four supported methods:

a. None
Select this option if your websites are publicly accessible without any authentication requirements.

b. Basic authentication
Enter your account's username and password to authenticate using basic authentication.

c. SiteMinder
Siteminder authentication requires a properly formatted URL, https://custom_siteminder_hostname/smapi/rest/createsmsession, a username, and a password.

d. Microsoft Entra OAuth 2.0 Client credentials
OAuth 2.0 with Microsoft Entra ID requires a resource ID, client ID, and a client secret.

The resource ID, client ID, and client secret values depend on how you did the setup for Microsoft Entra ID-based authentication for your website. One of the two specified options might be suitable for your website:

If you're using an Microsoft Entra application both as an identity provider and the client app to access the website, the client ID and the resource ID are the application ID of this single application, and the client secret is the secret that you generated in this application.

Note

For detailed steps to configure a client application as an Identity provider, see Quickstart: Register an application with the Microsoft identity platform and Configure your App Service or Azure Functions app to use Microsoft Entra login.

After the client app is configured, make sure you create a new client secret by going to the Certificates & Secrets section of the app. Copy the client secret value shown in the page because it isn't displayed again.

In the following screenshots, you can see the steps to obtain the client ID, and client secret, and set up the app if you're creating the app on your own.
- View of the settings in the branding section:
- View of the settings in authentication section:
  
  Note
  
  It is not required to have the above-specified route for Redirect URI on your website. Only if you use the user token sent by Azure in your website for authentication you will need to have the route.
- View of the client ID on the Essentials section:
- View of the client secret on the Certificates & secrets section:
If you're using an application (first app) as an identity provider for your website as the resource, and a different application (second app) to access the website, the client ID is the application ID of your second app and the client secret is the secret configured in the second app. However, the resource ID is the ID of your first app.

Note

For steps to configure a client application as an identity provider see Quickstart: Register an application with the Microsoft identity platform and Configure your App Service or Azure Functions app to use Microsoft Entra login.

You don't need to configure a client secret in this application, but you need to add an app role in the App roles section, which is later assigned to your client application. Refer to the images to see how to add an app role.
- Creating a new app role:
- Editing the new app role:
  
  After configuring the resource app, create the client app and give it permission to access the resource app by adding the app role configured above in the API permissions of the client app.
  
  Note
  
  To see how to grant permissions to the client app see Quickstart: Configure a client application to access a web API.
The following screenshots show the section to grant permissions to the client app.
- Adding a permission:
- Selecting the permissions:
- Adding the permissions:
Once the permissions are assigned, you need to create a new client secret for this application by going to the Certificates & secrets section. Copy the client secret value shown on the page as it isn't displayed again. Use the application ID from this app as the client ID, the secret from this app as the client secret, and the application ID of the first app as the resource ID.

4. Roll out to limited audience

Deploy this connection to a limited user base if you want to validate it in Copilot and other Search surfaces before expanding the rollout to a broader audience. To know more about limited rollout, see staged rollout.

At this point, you're ready to create the connection for your cloud websites. You can click Create to publish your connection and index webpages from your websites.

For other settings, like Access Permissions, Data Inclusion Rules, Schema, Crawl frequency, etc., we have defaults based on what works best with websites. You can see the default values below:

Users	Description
Access permissions	Everyone in your organization will see this content

Content	Description
URLs to exclude	None
Manage Properties	To check default properties and their schema, see content

Sync	Description
Incremental Crawl	Frequency: Every 15 mins (only supported with sitemap crawling)
Full Crawl	Frequency: Every Day

If you want to edit any of these values, you need to choose the "Custom Setup" option.

Custom Setup

Custom setup is for those admins who want to edit the default values for settings listed in the above table. Once you click on the "Custom Setup" option, you see three more tabs - Users, Content, and Sync.

Users

Access Permissions

The Enterprise Websites cloud connector supports search permissions visible to Everyone only. Indexed data appears in the search results for all users in your organization.

Content

Add URLs to exclude (Optional crawl restrictions)

There are two ways to prevent pages from being crawled: disallow them in your robots.txt file or add them to the Exclusion list.

Support for robots.txt

The connector checks to see if there's a robots.txt file for your root site. If one exists, it follows and respects the directions found within that file. If you don't want the connector to crawl certain pages or directories on your site, include the pages or directories in the "Disallow" declarations in your robots.txt file.
Add URLs to exclude

You can optionally create an Exclusion list to exclude some URLs from getting crawled if that content is sensitive or not worth crawling. To create an exclusion list, browse through the root URL. You can add the excluded URLs to the list during the configuration process.

Manage Properties

Here, you can add or remove available properties from your websites, assign a schema to the property (define whether a property is searchable, queryable, retrievable, or refinable), change the semantic label and add an alias to the property. Properties that are selected by default are listed below.

Source Property	Label	Description	Schema
Authors	Authors	People who participated on the item in the data source	Query, Retrieve
Content	Content	All text content in a webpage	Search
CreatedDateTime	Created date time	Data and time that the item was created in the data source	Query, Retrieve
Description			Retrieve, Search
FileType	File extension	The file extension of crawled content	Query, Refine, Retrieve
IconURL	IconUrl	Icon url of the webpage	Retrieve
LastModifiedBy	Last modified by	Person who last modified the item in data source	Query, Retrieve
LastModifiedDateTime	Last modified date time	Date and time the item was last modified in the data source.	Query, Retrieve
Title	Title	The title of the item that you want shown in Copilot and other search experiences	Retrieve, Search
URL	url	The target URL of the item in the data source	Retrieve

The Enterprise Website cloud connector supports two types of source properties:

Meta tag

The connector fetches any meta tags your root URLs may have and shows them. You can select which tags to include for crawling. A selected tag gets indexed for all provided URLs, if available.

Selected meta tags can be used to create custom properties. Also, on the schema page, you can manage them further (Queryable, Searchable, Retrievable, Refinable).
Custom property settings

You can enrich your indexed data by creating custom properties for your selected meta tags or the connector's default properties.

To add a custom property:
1. Enter a property name. This name appears in search results from this connector.
2. For the value, select Static or String/Regex Mapping. A static value is included in all search results from this connector. A string/regex value varies based on the rules you add.
3. If you selected a static value, enter the value you want to appear.
4. If you selected a String/rRegex value:
  - In the Add expressions section, in the Property list, select a default property or meta tag from the list. For Sample value, enter a string to represent the type of values that could appear. This sample is used when you preview your rule. For Expression, enter a regex expression to define the portion of the property value that should appear in search results. You can add up to three expressions.
  - In the Create formula section, enter a formula to combine the values extracted from the expressions.

To learn more about regex expressions, see .NET regular expressions or search the web for a regex expression reference guide.

Sync

The refresh interval determines how often your data is synced between the data source and the Graph connector index. There are two types of refresh intervals - full crawl and incremental crawl. For more details, see refresh settings.

You can change the default values of refresh interval from here if you want to.