Create and run a scan on registered websites with tracker scanning (preview)
After registering a website and creating tracker categories, you’re ready to scan the website. A scan works to identify up to four tracker technologies on a website: cookies, fingerprints, local storage objects, and web beacons.
You can set up multiple scans on a website to look for different parameters, such as the region the website will be scanned from. To create, edit, or run a scan, you start from the details page of the registered website you want to scan.
To set up a scan on a website, follow these steps:
On the Registered websites page, select a website name to open its details page.
On the website’s details page, select New compliance scan.
On the Scan flyout pane, enter a Compliance scan name. Once a scan is created, its name can’t be changed later, even though the parameters of the scan itself can be edited.
Enter a brief description of the scan.
At Scan region, select the region from which the scan will be run, to mimic the behavior of a website in a specific region. For example, if your consent banner is only turned on in regions where required, you can run your scan in a scanning region that best reflects the regulatory region in scope. The currently supported regions are: EastUS2, UK South, and West Europe (learn more about region availability).
At Crawl definition, choose which level of scan to run from the dropdown menu:
Registered URL only: Scan the first page of the registered domain only.
Crawl (registered URL + two levels of pages): Scan the registered URL and two levels of webpages under the domain; for example,
contoso.com
and any pages withcontoso.com/something
orcontoso.com/something/something
. When you select this option, you can choose to limit the scan to a certain number of pages. Scan time depends on the number of internal links in a website.Sitemap: If your website has a sitemap, select this option to scan all the pages identified in the sitemap. Provide the sitemap URL with the full syntax; for example,
https://www.contoso.com/store/collections.xml
. The sitemap content size shouldn't exceed 3 MB. The scanner scans all the pages identified in the sitemap. If the sitemap has many links, it could result in increased scan times and costs.
Limit scan to appears if you select Crawl or Sitemap. Define the maximum number of pages to scan. Increasing pages could result in a longer scan time and varying costs. Scan time will likely increase with a larger volume of pages.
Show scan time estimates appears if you select Crawl or Sitemap. Select this option to view an estimate of the scanned and unscanned pages. This is a prerequisite to using the Manage URLs capability, which allows you to exclude specific URLs from subsequent scans to save time and help your scans run more efficiently. Then select Continue.
The Include access steps for authentication or website interaction step is optional and should be skipped if not needed, as it instructs the tool to perform one or more actions before scanning. If your website does not require the scan to bypass basic authentication or emulate specific visitor click or fill interactions, select Continue without selecting an Access Type.
Note
There are additional steps for scanning authenticated sites. We suggest coming back to this step later and finishing the initial scan creation process to ensure the scan is set up to run as expected.
At Tracker technology & tags, select the items you want to scan across your website. Get details in the scan definition instructions. Then select Continue.
At Set a scan trigger, set the frequency of the scan to run Recurring or Once. Get details in the scan trigger instructions. Then select Continue.
At Review your scan, review the scan settings. Then select Save and run, which provides an option to select Save to just save the scan, or Save and run, which will run a first scan.
After you finish creating or editing your scan, once you select Save and run the scan will run for the first time, even if you set a recurring time in the Scan trigger stage.
On the registered website’s details page, all of its scans are listed on the Compliance scans tab. When a scan runs, select Refresh to view its current status:
Queued: Scan is waiting for other processes to finish before starting the compliance scan.
In progress: Scan is running and will continue until finished or an error occurs.
Completed: Scan successfully completed and results can be viewed.
Canceled: Scan was canceled by a user and won’t show any results outside of the runtime, if any.
Failed, More info: An error occurred while running; more info can be found by selecting More info.
You can initiate a scan at any time by selecting Run scan now on a scan’s details page.
To find the scan’s details page, go to the Registered websites page and select a website name from the list. On the registered website's details page, select Compliance scans on the page's left navigation, then select a scan name to open its details page. From here you can edit the scan, view and manage URLs, view scan history and results, and view trackers and tags.
If the website you want to scan is an authenticated site requiring credentials and sign-in procedures, you can provide credentials in the scan setup process so that the scan can run. If you don’t provide credentials, the scan can’t bypass the sign-in portal and you might not be able to access and scan webpages that require a visitor interaction.
Note
Only basic authentication is supported. Multifactor authentication or cognitive-based techniques aren't feasible because a scanner needs steps ahead of the scan run.
Tracker scanning can scan two access types:
Authentication: A scan capability to bypass basic authentication of username and password by using credentials stored in Azure key vault.
Website interaction: A scan capability to emulate specific visitor click or fill interactions prior to scanning for trackers and compliance objects.
Important
To run an Authentication scan, the admin for your organization must first set up a connection between Azure key vault and the Microsoft Purview account you use for compliance scans. Visit Credentials for source authentication in Microsoft Purview to create a key vault and connect it to your account before setting up an Authentication scan.
If you’re creating a new scan, use the instructions below when you select the Include access steps for authentication or website interaction option (at step 9 of Set up a scan). If you already created a scan, open the registered website’s details page. On the Compliance scans tab, select the scan, then select Edit scan. Then follow these steps:
Select Continue to advance past the first page.
On the Include access steps for authentication or website interaction page, for Access type, select Authentication.
For Credential, select the credential to use for the scan. The dropdown options come from your key vault. If you don’t see the credentials to use, make sure they're added to key vault and connected to your organization's Microsoft Purview account.
Create a number of access steps that replicate the website visitor’s activity to bypass basic authentication by selecting Add step. This process involves collecting the location paths of various web components. This can be performed manually or using the Microsoft Edge extension to generate a JSON file for upload
Within a step, for Action type, select Click, Select, Check, or Enter.
Enter must be associated with either Custom, Username, or Password.
If Username or Password is selected, their Field values autopopulate.
Example setup: Each step requires the user to provide location paths:
- Custom – login button
- Username – Enter
- Password – Enter
- Custom – Submit login
For Object name, you can enter a name for the field as reference.
Capture and provide the location path, or XPath, for the field you’re referencing in the step. Get instructions for collecting location paths.
Repeat for each step of the flow.
When done, confirm that only the required steps are included, delete any blank steps, and select Continue.
Continue building your scan on the Set up scan definition page.
If you’re creating a new scan, use the instructions below when you select the Include access steps for authentication or website interaction option (at step 9 of Set up a scan). If you already created a scan, open the registered website’s details page. On the Compliance scans tab, select the scan, then select Edit scan. Then follow these steps:
Select Continue to advance past the first page.
On the Include access steps for authentication or website interaction page, for Authentication type, select Website interaction.
Capture and provide the full location path, or XPath, for the field you’re referencing in the step. You can do this manually or by using the Microsoft Edge extension to generate a JSON file for upload. Visit the instructions for collecting location paths, and repeat for each step of the flow.
At Method to set up access steps, select either Manually add XPaths or Upload XPaths file.
If you chose to upload a file, add it in the Upload XPaths file field.
When done, confirm that only the required steps are included, delete any blank steps, and select Continue.
Continue building your scan at the Scan definition page.
The Scan definition page in the scan creation process is where you tell the scan what to look for on each webpage. The possible elements consist of trackers, tags, and various compliance objects that you need to confirm are present.
At Tracker technology, select the box next to the trackers you want to scan for and whether to capture associated tags and relationships. These trackers and tags are deployed from your website onto a visitor’s browser or device.
Cookies: Scan for first and third party cookies deployed upon loading the webpage being scanned. Given the attributes of a cookie, these are always considered trackers.
Fingerprints: Scan for fingerprinting; though fingerprints are used by some websites for standard website UX configuration, such as screen size and preferred language. In some instances, fingerprints could be used as tracking technologies to build user profiles in combination with other trackers or data. Fingerprints are captured as part of a scan when all various conditions are met; there's a possibility of these being nontracking web components.
Local Storage Objects (LSOs): Are captured as part of a scan when all various conditions are met. LSOs can sometimes be nontracking web components.
Web beacons (1x1 pixel): Are captured as part of a scan when all various conditions are met. Web beacons can sometimes be nontracking web components.
The Capture tags and relationships for all selected trackers option captures the preceding tag, allowing users to view the tracker relationship in scan results. Trackers require an HTML Tag (Script, iFrame, or Image) for deployment on a site visitor’s device.
Tracker relationship: The capability to view in list or graphical form the relationships between trackers and tags. There can be multiple trackers deployed by a single tag. Tracker relationships are captured on a per-scan basis.
Compliance objects are displayed as tiles for common website compliance objects, such as Consent banner and an externally facing Privacy statement. Selecting these elements can help verify if the compliance object is present.
You can add your own compliance object; be sure to enter a name carefully, as it shows up in reporting. Select one or more compliance objects that you want to scan for, and within each object’s tile, enter its location path in the text field. Collecting the location path can be performed manually or by using the Microsoft Edge extension to generate a JSON file for upload. Get instructions for collecting location paths.
Important
Be sure to check the box on each compliance object’s tile that you want to include in a scan.
The Scan definition page in the scan creation process is where you set the frequency of the scan to run once or on a recurring basis. This setting lets you set up a scan and monitor on a cadence for any potential compliance issues.
If you select Recurring, you see options to select specific days within a weekly or monthly cadence. Select a time and select start and end dates. The first scan runs after you complete the scan setup process, and then the recurring options you selected will take effect.
When you’re done, select Continue to advance to the review step before you save and run it.
The Microsoft Priva: Scan Set Up Tool is a browser extension you can install to help facilitate the collection of location paths, or XPaths. You can also collect location paths manually instead of using the brower extention by following the steps below.
On any location on a webpage, right-click and select Inspect to open DevTools.
The DevTools area appears and displays the Elements page, highlighting the web element that was inspected.
Right-click the highlighted web element, select Copy, then select Copy XPath.
Go back to the scan setup page in Tracker Scanning and paste the copied XPath into the Location path field.
Follow the steps below to install the Microsoft Priva: Scan Set Up Tool browser extension:
Open a Microsoft Edge browser and navigate to: https://microsoftedge.microsoft.com/addons/detail/bldbcilhcjhoookkgcbmglgjdlbjihgo.
Select Get, then select Add extension.
To the right of your browser’s address bar, select the Extensions icon.
Next to the Microsoft Priva: Scan Set Up Tool extension name, select the hidden eye icon, which will show the extension’s icon on your browser’s toolbar.
Open a new Microsoft Edge window and enter the URL you’re creating a scan for. The window can’t be an InPrivate window.
Select the Microsoft Priva: Scan Set Up Tool icon next to the browser’s address bar, and select a collection mode:
Collect steps to access pages: Use this during Authentication or Website interaction step setup.
Collect paths of compliance objects: Use this for adding compliance objects during Scan definition setup.
Hover over the object on the website you want to capture. A shading appears over the area, and a window appears underneath it with a Collect this location path button.
Slowly move your cursor down to select Collect this Location Path. You’ll see a confirmation message at the top that the location path has been collected. A Review button in the confirmation message lets you review the collected XPath and provides an option to Confirm & continue, or Discard location path so you can try again.
Repeat step 3 to capture all the XPaths you need. As you collect XPaths, a number on the extension icon indicates the number collected.
When you’re done collecting XPaths, select the extension icon, then select Download XPaths collected in a single file. The XPaths are download in a JSON file.
Navigate back to tracker scanning’s scan setup on either of the authentication steps: Website interaction steps or compliance object definitions. Select the option to Upload location path file. Select the downloaded file titled either WebInteractions# or ComplianceObjects#, and select Open from the file explorer.
You'll see compliance objects or the access steps collected. Make any modifications and select Continue when done.
After running your first scan, you can exclude specific URLs from subsequent scans to save time and help your scans run more efficiently. This option is available if Show scan time estimates is selected during scan setup. For example, if you’re scanning a product page (such as, www.contoso.com/products
) that has a large number of subpages for individual products (such as, www.contoso.com/products/widgets
), you can exclude the product page so that future scans don’t continue to run on all of its subpages.
Excluding URLs can also help you avoid exceeding a scan limit you set when you created the scan and thus possibly missing other important pages to scan.
Note
All scan results are preserved in the audit trail of previous scans, so even if you exclude URLs on future scans, a record remains of its scan results before the URL exclusion.
Viewing and managing your excluded URLs takes place on the URL exclude list tab on a scan’s details page. To delete URLs and manage a scan’s URL list, follow these steps:
Go to the registered website’s details page and select the Compliance scans tab on the left navigation.
Select the scan name to open its details page.
Go to the URL exclude list tab and select Manage URLs to open the Manage URLs flyout pane.
The flyout pane lists all the URLs detected in the first scan of the website. Check the box next to the URLs you want to exclude. Details to note:
Selecting Exclude will exclude the subpage and underlying pages for all future scans.
Each URL listed shows the scan state (Scanned or Unscanned), the number of pages scanned out of the total number of subpages for that URL, and the estimated scan time. As you mark URLs to exclude, the Total estimated scan time and Total estimated pages tiles update to reflect the proposed exclusions.
The Page limit field at the top of the flyout pane allows you to modify the page limit you set when you created a scan. If you make your page limit adjustment here, you don’t need to back into the scan to edit it when you’re done with this process.
The registered website URL can’t be excluded.
When done, select Save.
The flyout pane closes and the excluded URLs appear on the URL exclude list page.