Jaa


Connect to Azure Data Lake Gen1 in Microsoft Purview

This article outlines the process to register an Azure Data Lake Storage Gen1 data source in Microsoft Purview including instructions to authenticate and interact with the Azure Data Lake Storage Gen1 source.

Note

On Feb 29, 2024, Azure Data Lake Storage Gen1 will be retired. For more information, see the official announcement. After that date scanning Azure Data Lake Storage Gen1 accounts will not be supported. If you have migrated your data lake from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2, please follow the guidance Connect to Azure Data Lake Storage in Microsoft Purview to register an Azure Data Lake Storage Gen2 data source and run a scan. The existing Azure Data Lake Storage Gen1 data assets will be retained in Microsoft Purview and you need to manually delete them if they are no longer needed.

Supported capabilities

Metadata Extraction Full Scan Incremental Scan Scoped Scan Classification Labeling Access Policy Lineage Data Sharing Live view
Yes Yes Yes Yes Yes Yes No Limited** No No

** Lineage is supported if dataset is used as a source/sink in Data Factory Copy activity

Prerequisites

Register

This section will enable you to register the ADLS Gen1 data source and set up an appropriate authentication mechanism to ensure successful scanning of the data source.

Steps to register

It is important to register the data source in Microsoft Purview prior to setting up a scan for the data source.

  1. Open the Microsoft Purview governance portal by:

  2. Navigate to the Data Map --> Sources

    Screenshot that shows the link to open Microsoft Purview governance portal

    Screenshot that navigates to the Sources link in the Data Map

  3. Create the Collection hierarchy using the Collections menu and assign permissions to individual subcollections, as required

    Screenshot that shows the collection menu to create collection hierarchy

  4. Navigate to the appropriate collection under the Sources menu and select the Register icon to register a new ADLS Gen1 data source

    Screenshot that shows the collection used to register the data source

  5. Select the Azure Data Lake Storage Gen1 data source and select Continue

    Screenshot that allows selection of the data source

  6. Provide a suitable Name for the data source, select the relevant Azure subscription, existing Data Lake Store account name and the collection and select Apply

    Screenshot that shows the details to be entered in order to register the data source

  7. The ADLS Gen1 storage account will be shown under the selected Collection

    Screenshot that shows the data source mapped to the collection to initiate scanning

Scan

Prerequisites for scan

In order to have access to scan the data source, an authentication method in the ADLS Gen1 Storage account needs to be configured. The following options are supported:

Note

If you have firewall enabled for the storage account, you must use managed identity authentication method when setting up a scan.

  • System-assigned managed identity (Recommended) - As soon as the Microsoft Purview Account is created, a system Managed Identity is created automatically in Microsoft Entra tenant. Depending on the type of resource, specific RBAC role assignments are required for the Microsoft Purview SAMI to perform the scans.

  • User-assigned managed identity (preview) - Similar to a system-managed identity, a user-assigned managed identity is a credential resource that can be used to allow Microsoft Purview to authenticate against Microsoft Entra ID. For more information, you can see our user-assigned managed identity guide.

  • Service Principal - In this method, you can create a new or use an existing service principal in your Microsoft Entra tenant.

Authentication for a scan

Using system or user-assigned managed identity for scanning

It is important to give your Microsoft Purview account the permission to scan the ADLS Gen1 data source. You can add the system managed identity, or user-assigned managed identity at the Subscription, Resource Group, or Resource level, depending on what you want it to have scan permissions on.

Note

You need to be an owner of the subscription to be able to add a managed identity on an Azure resource.

  1. From the Azure portal, find either the subscription, resource group, or resource (for example, an Azure Data Lake Storage Gen1 storage account) that you would like to allow the catalog to scan.

  2. Select Overview and then select Data explorer

    Screenshot that shows the storage account

  3. Select Access in the top navigation

    Screenshot that shows the Data explorer for the storage account

  4. Choose Select and add the Microsoft Purview Name (which is the system managed identity) or the user-assigned managed identity(preview), that has already been registered in Microsoft Purview, in the Select user or group menu.

  5. Select Read and Execute permissions. Make sure to choose This folder and all children, and An access permission entry and a default permission entry in the Add options as shown in the below screenshot. Select OK

    Screenshot that shows the details to assign permissions for the Microsoft Purview account

Tip

An access permission entry is a permission entry on current files and folders. A default permission entry is a permission entry that will be inherited by new files and folders. To grant permission only to currently existing files, choose an access permission entry. To grant permission to scan files and folders that will be added in future, include a default permission entry.

Using Service Principal for scanning

Creating a new service principal

If you need to Create a new service principal, it is required to register an application in your Microsoft Entra tenant and provide access to Service Principal in your data sources. Your Microsoft Entra Application Administrator can perform this operation.

Getting the Service Principal's application ID
  1. Copy the Application (client) ID present in the Overview of the Service Principal already created

    Screenshot that shows the Application (client) ID for the Service Principal

Granting the Service Principal access to your ADLS Gen1 account

It is important to give your service principal the permission to scan the ADLS Gen2 data source. You can add access for the service principal at the Subscription, Resource Group, or Resource level, depending on what permissions it needs.

Note

You need to be an owner of the subscription to be able to add a service principal on an Azure resource.

  1. Provide the service principal access to the storage account by opening the storage account and selecting Overview --> Data Explorer

    Screenshot that shows the storage account

  2. Select Access in the top navigation

    Screenshot that shows the Data explorer for the storage account

  3. Select Select and Add the Service Principal in the Select user or group selection.

  4. Select Read and Execute permissions. Make sure to choose This folder and all children, and An access permission entry and a default permission entry in the Add options. Select OK

    Screenshot that shows the details to assign permissions for the service principal

Creating the scan

  1. Open your Microsoft Purview account and select the Open Microsoft Purview governance portal

  2. Navigate to the Data map --> Sources to view the collection hierarchy

    Screenshot that shows the collection hierarchy

  3. Select the New Scan icon under the ADLS Gen1 data source registered earlier

    Screenshot that shows the data source with the new scan icon

  4. Choose either the Azure integration runtime if your source is publicly accessible, a managed virtual network integration runtime if using a managed virtual network, or a self-hosted integration runtime if your source is in a private virtual network. For more information about which integration runtime to use, see the choose the right integration runtime configuration article.

If using system or user-assigned managed identity

Provide a Name for the scan, select the system or user-assigned managed identity under Credential, choose the appropriate collection for the scan, and select Test connection. On a successful connection, select Continue.

Screenshot that shows the managed identity option to run the scan

If using Service Principal

  1. Provide a Name for the scan, choose the appropriate collection for the scan, and select the + New under Credential

    Screenshot that shows the service principal option

  2. Select the appropriate Key vault connection and the Secret name that was used while creating the Service Principal. The Service Principal ID is the Application (client) ID copied as indicated earlier

    Screenshot that shows the service principal key vault option

  3. Select Test connection. On a successful connection, select Continue

    Screenshot that shows the test connection for service principal

Scoping and running the scan

  1. You can scope your scan to specific folders and subfolders by choosing the appropriate items in the list.

    Scope your scan

  2. Then select a scan rule set. You can choose between the system default, existing custom rule sets, or create a new rule set inline.

    Scan rule set

  3. If creating a new scan rule set, select the file types to be included in the scan rule.

    Scan rule set file types

  4. You can select the classification rules to be included in the scan rule

    Scan rule set classification rules

    Scan rule set selection

  5. Choose your scan trigger. You can set up a schedule or run the scan once.

    scan trigger

    scan trigger selection

  6. Review your scan and select Save and run.

    review scan

Viewing Scan

  1. Navigate to the data source in the Collection and select View Details to check the status of the scan

    view scan

  2. The scan details indicate the progress of the scan in the Last run status and the number of assets scanned and classified

    view scan detail

  3. The Last run status will be updated to In progress and then Completed once the entire scan has run successfully

    view scan in progress

    view scan completed

Managing Scan

Scans can be managed or run again on completion.

  1. Select the Scan name to manage the scan

    manage scan

  2. You can run the scan again, edit the scan, delete the scan

    manage scan options

    Note

    • Deleting your scan does not delete catalog assets created from previous scans.
    • The asset will no longer be updated with schema changes if your source table has changed and you re-scan the source table after editing the description in the schema tab of Microsoft Purview.
  3. You can run an incremental scan or a full scan again.

    manage scan full or incremental

    manage scan results

Next steps

Now that you have registered your source, follow the below guides to learn more about Microsoft Purview and your data.