Duplicate Lineage got inserted into Azure Purview

Sri Lakshman Velugubantla 20 Reputation points
2025-01-02T17:40:17.5266667+00:00

Hi Microsoft Team,

I have registered and scanned the Databricks Unity Catalog in Azure Purview. Initially, all tables and views metadata were imported into Purview, but the lineage was not pulled in. After enabling the access schema in Unity Catalog, which includes the table_lineage and column_lineage tables containing lineage information, the lineage data was extracted and pulled into Azure Purview.

However, I am now encountering an issue with duplicate lineages for the same table in Azure Purview. This is because the table_lineage and column_lineage tables contain historical lineage data. We run notebooks daily, and each run stores lineage data in these tables with different entity_ids, resulting in duplicate lineages for each table.

I cannot clean up the lineage data in the Databricks tables as it would affect other aspects of the Databricks catalog.

Could you please help me resolve this issue? Is there a way for Purview to take only the latest lineage from these tables? Can I create a new table with only the latest lineage data and scan that table instead? Will this approach work?

if works then which columns are required for purview to create a lineage process?? Please let me know i will create a table with those columns and insert latest lineage data and scan again.

Microsoft Security | Microsoft Purview
{count} votes

2 answers

Sort by: Most helpful
  1. David Broggy 6,371 Reputation points MVP Volunteer Moderator
    2025-01-02T19:40:05.4733333+00:00

    Hi @sri lakshman,

    I'm not aware of a feature for Purview to read just the latest lineage.

    You would need to create another table which contains only the latest lineage and have Purview scan that.

    I appreciate that would mean you need to maintain a new table but that's my recommended solution.

    good luck.


  2. Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator
    2025-01-10T21:24:41.01+00:00

    @Sri Lakshman Velugubantla

    Unfortunately, as of now, Azure Purview does not natively support combining multiple column-level lineage records into a single process during ingestion. Each row in the lineage table is ingested as an independent process, which is why you're observing separate processes for each column mapping.

    You're right that Azure Purview, when ingesting lineage from a table like the one you described with source_column and target_column, will indeed create separate processes for each column mapping.

    This means that instead of one process showing:

    Process: Customer Transformation (JOIN) 
              - Source: test.silver.customer 
              - Target: test.gold.customer                           
                             - Source Column: CustomerName                           
                             - Target Column: CustomerNm                           
                             - Source Column: CustomerNumber                           
                             - Target Column: CustomerNum 
    

    You would actually see two separate processes:

    Process 1: CustomerName to CustomerNm (JOIN) 
                - Source: test.silver.customer 
                              - Source Column: CustomerName 
                - Target: test.gold.customer 
                              - Target Column: CustomerNm 
    Process 2: CustomerNumber to CustomerNum (JOIN) 
                 - Source: test.silver.customer 
                                - Source Column: CustomerNumber 
                 - Target: test.gold.customer 
                                 - Target Column: CustomerNum 
    

    This is a key limitation when using a table to import lineage into Purview. It doesn't have the capability to aggregate multiple column mappings into a single process representing a broader transformation.

    Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

    I hope this information helps.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.