Microsoft Purview Data Lineage best practices

Data Lineage is broadly understood as the lifecycle that spans the data’s origin, and where it moves over time across the data estate. Microsoft Purview can capture lineage for data in different parts of your organization's data estate, and at different levels of preparation including:

  • Raw data staged from various platforms
  • Transformed and prepared data
  • Data used by visualization platforms

Why do you need adopt Lineage?

Data lineage is the process of describing what data exists, where it's stored and how it flows between systems. There are many reasons why data lineage is important, but at a high level these can all be boiled down to three categories that we'll explore here:

  • Track data in reports
  • Impact analysis
  • Capture the changes and where the data has resided through the data life cycle

Azure Data Factory Lineage best practice and considerations

Azure Data Factory instance

  • Data lineage won't be reported to the catalog automatically until the Data Factory connection status turns to Connected. The rest of status Disconnected and CannotAccess can't capture lineage.

    Screen shot showing a data factory connection list.

  • Each Data Factory instance can connect to only one Microsoft Purview account. You can establish new connection in another Microsoft Purview account, but this will turn existing connection to disconnected.

    Screenshot showing warning to disconnect Azure Data Factory.

  • Data factory's managed identity is used to authenticate lineage push operations in Microsoft Purview account. The data factory's managed identity needs Data Curator role on Microsoft Purview root collection.

  • Currently, only 10 data factories can be connected at a time. If you want to add more than 10 data factories, please add 10 new data factory connections at a time using the wizard or use API to connect more than 10 data factories in one operation.

Azure Data Factory activities

  • Microsoft Purview captures runtime lineage from the following Azure Data Factory activities:

  • Microsoft Purview drops lineage if the source or destination uses an unsupported data storage system.

  • Microsoft Purview can't capture lineage if Azure Data Factory copy activity uses copy activity features listed in Limitations on copy activity lineage of Connect to Azure Data Factory

  • For the lineage of Dataflow activity, Microsoft Purview only support source and sink. The lineage for Dataflow transformation isn't supported yet.

  • Data flow lineage doesn't integrate with Microsoft Purview resource set. Resource set example:
    Qualified name: https://myblob.blob.core.windows.net/sample-data/data{N}.csv Display name: "data"

  • For the lineage of Execute SSIS Package activity, we only support source and destination. The lineage for transformation isn't supported yet.

    Screenshot of the Execute SSIS lineage in Microsoft Purview.

  • Please refer the following step-by-step guide to push Azure Data Factory lineage in Microsoft Purview.

Build custom lineage manually or with REST APIs

One of the important platform features of Microsoft Purview is the ability to show the lineage between datasets created by data processes. Systems like Data Factory, Data Share, and Power BI capture the lineage of data as it moves. In certain situations, automatically generated lineage by Purview is incomplete or missing for practical visualization and/or enterprise reporting purposes. In those scenarios, you can create custom lineage entries manually in the Microsoft Purview portal, or via Apache Atlas hooks and the REST API. Another major benefit of using REST APIs to report or build custom lineage is to overcome or mitigate the limitations of functionality exposed by Manual Lineage.

To build custom lineage manually, you can follow this user guide: Manual lineage entries in Microsoft Purview.

To build custom lineage in Microsoft Purview using the REST APIs, follow this user guide: Microsoft Purview - Building Custom Lineage using REST APIs.

Tip

In some cases, the REST APIs can provide more input and customization options than building the lineage entries manually through the portal.

Next steps