Scan data sources in Microsoft Purview

In Microsoft Purview, after you register your data source, you can scan your source to capture technical metadata, extract schema, and apply classifications to your data.

In this article, you'll learn the basic steps for scanning any data source.

Tip

Each source has its own instructions and prerequisites for scanning. For the most complete scanning instructions, select your source from the supported sources list and review its scanning instructions.

Prerequisites

Here's a list of all the sources that are currently available to register and scan in Microsoft Purview.

Before you can scan your data source, you must take these steps:

  1. Register your data source - This essentially gives Microsoft Purview the address of your data source, and maps it to a collection or domain in the Microsoft Purview Data Map.
  2. Consider your network and choose the right integration runtime configuration for your scenario.
  3. Consider what credentials you're going to use to connect to your source. All source pages will have a Scan section that will include details about what authentication types are available.

Create a scan

In the steps below we'll be using Azure Blob Storage as an example, and authenticating with the Microsoft Purview Managed Identity.

Important

These are the general steps for creating a scan, but you should refer to the source page for source-specific prerequistes and scanning instructions.

  1. Open the Microsoft Purview portal and navigate to the Data map -> Data sources to view your registered sources either in a map or table view.

    Tip

    If your data map has a large number of registered sources, the table view may be more performant.

  2. Find your source and select the New Scan icon.

    Screenshot the new scan button highlighted by a registered source and the new scan window.

  3. Provide a Name for the scan.

  4. Select your authentication method. Here we chose the Purview MSI (managed identity.)

    Screenshot that shows the managed identity option to run the scan.

  5. Choose the current domain, collection, or a sub collection for the scan. The collection or domain you choose will house the metadata discovered during the scan.

    Note

    Scan will always be in the same domain as the registered source, but you can select a subcollection.

  6. Select Test connection. If it isn't successful, see our [troubleshooting] section. On a successful connection, select Continue.

  7. Depending on the source, you can scope your scan to a specific subset of data. For Azure Blob Storage, we can select folders and subfolders by choosing the appropriate items in the list.

    Screenshot showing the scope your scan window with files and folders selected.

  8. Select a scan rule set. The scan rule set contains the kinds of data classifications your scan will check for. You can choose between the system default (that will contain all classifications available for the source), existing custom rule sets made by others in your organization, or create a new rule set inline.

    Note

    You can only select the credentials and scan rule sets associated with the domain where your source is registered.

    Screenshot of the select a scan rule set page with the default set selected.

  9. Choose your scan trigger. You can set up a schedule or run the scan once. Learn more about the supported schedule options.

    Screenshot of the set a scan trigger page showing a recurring monthly schedule.

  10. Review your scan and select Save and run.

    Screenshot of the scan review page with the save and run button highlighted.

Schedule a scan

When setting up the scan, you can choose to run it once / on-demand, or on a recurrence schedule. You can configure the following schedule options:

  • Time zone: Select the time zone you'd like to align your scan schedule with. If the time zone you select observes daylight savings, the trigger will autoadjust for the difference.
  • Recurrence: You can select a daily, weekly, or monthly scan recurrence.
    • Daily recurrence: Set recurrence to every X day(s), and specify the scan start time of the day.
    • Weekly recurrence: Set recurrence to every X week(s), select one or multiple day(s) of the week, and specify the scan start time of the day.
    • Monthly recurrence: Set recurrence to every X month(s), choose between by month days or by weekdays, select one or multiple day(s)/weekday(s) of the month, and specify the scan start time of the day.
  • Start recurrence at: Set when the scan schedule begins.
  • Specify recurrence end date (optional): If you want to stop the scan after a certain amount of time, you can enable this option by selecting the check box and provide an end date.

Screenshot of the set a scan trigger page.

View a scan

Depending on the amount of data in your data source, a scan can take some time to run, so here's how you can check on progress and see results when the scan is complete.

  1. You can view your scan from the collection, domain, or from the source itself.

  2. To view from the collection or domain, navigate to your Collection or Domain in the data map, and select the Scans button.

    Screenshot of the collection page with the scans button highlighted.

  3. Select your scan name to see details.

    Screenshot of the scans in the collection list with the most recent scan name highlighted.

  4. Or, you can navigate directly to the data source in its Collection or Domain and select View Details to check the status of the scan.

    Screenshot of the data map with a source's view details button highlighted.

  5. The scan details indicate the progress of the scan in the Last run status and the number of assets scanned and classified.

    Screenshot of a source detail page, with the assets and scans highlighted.

  6. The Last run status will be updated to In progress and then Completed once the entire scan has run successfully

    Screenshot of a source detail page with a scan showing an in progress status.

    Screenshot of a source detail page with a scan showing a completed status.

Manage a scan

After a scan is complete, it can be managed or run again.

  1. Select the Scan name from either the collections list or the source page to manage the scan.

    Screenshot of a source details page with the scan name link highlighted.

  2. You can run the scan again, edit the scan, delete the scan

    Screenshot of a manage scan page with the run, edit, and delete buttons highlighted.

  3. You can run a full scan, which will scan all the content in your scope, but some sources also have incremental scan available. Incremental scan will scan only those resources that have been updated since the last scan. Check the supported capabilities table in your source page to see if incremental scan is available for your source after the first scan.

    Screenshot of the run scan now button showing the full and incremental scan options.

Troubleshooting

Setting up the connection for your scan can complex since it's a custom set up for your network and your credentials.

If you're unable to connect to your source, follow these steps:

  1. Review your source page prerequisites to make sure there's nothing you've missed.
  2. Review your authentication option in the Scan section of your source page to confirm you have set up the authentication method correctly.
  3. Review our troubleshoot connections page.
  4. Create a support request, so our support team can help you troubleshoot your specific environment.

Next steps