Duplicate Lineage got inserted into Azure Purview

Question

Duplicate Lineage got inserted into Azure Purview

Sri Lakshman Velugubantla 20

Hi Microsoft Team,

I have registered and scanned the Databricks Unity Catalog in Azure Purview. Initially, all tables and views metadata were imported into Purview, but the lineage was not pulled in. After enabling the access schema in Unity Catalog, which includes the table_lineage and column_lineage tables containing lineage information, the lineage data was extracted and pulled into Azure Purview.

However, I am now encountering an issue with duplicate lineages for the same table in Azure Purview. This is because the table_lineage and column_lineage tables contain historical lineage data. We run notebooks daily, and each run stores lineage data in these tables with different entity_ids, resulting in duplicate lineages for each table.

I cannot clean up the lineage data in the Databricks tables as it would affect other aspects of the Databricks catalog.

Could you please help me resolve this issue? Is there a way for Purview to take only the latest lineage from these tables? Can I create a new table with only the latest lineage data and scan that table instead? Will this approach work?

if works then which columns are required for purview to create a lineage process?? Please let me know i will create a table with those columns and insert latest lineage data and scan again.

Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-06T17:23:52.9733333+00:00

Hi @Sri Lakshman Velugubantla

In addition to @David Broggy

Can you please tell me which columns are needed for purview to pick up lineage from this newly created table?

Thank you for your patience and apologies for the delay in response. To ensure that Azure Purview can successfully extract lineage information from your newly created table, you will need to include the following key columns:

source_entity_id - This column should identify the source of the data transformation. This could be the fully qualified name of a table, view, or even a specific file. For example: database.schema.table_name.

target_entity_id - This column should identify the destination of the data transformation. Similar to the source entity, it should be the fully qualified name. For example: database.schema.table_name.

relationship_type - While not strictly required for basic lineage, including a column describing the transformation (e.g., "SELECT," "JOIN," "FILTER," "AGGREGATE") can significantly enhance the lineage visualization and understanding within Purview.

timestamp - A timestamp column is crucial for identifying the latest lineage. This allows you to easily filter and select the most recent lineage information when populating your new table. This could be a timestamp of the operation execution or a lineage update timestamp.

column_name (optional) - If you need column-level lineage, you'll need additional columns to specify the source and target columns involved in the transformation. This might involve pairs of columns like source_column and target_column.

Once you have created this new table with the necessary columns, you can register and scan it in Azure Purview. This should help in avoiding duplicate lineage entries and ensure that only the latest lineage data is captured.

I hope this information helps. Please do let us know if you have any further queries.

Thank you.
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-07T16:56:15+00:00

@Sri Lakshman Velugubantla

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Sri Lakshman Velugubantla 20 Reputation points

2025-01-07T17:14:19.01+00:00

Hi @Chandra Boorla ,

I thought to create separate table to put latest lineage there, but it requires manual effort every time to update the lineage in that table so, since we have that lineage available in Databricks unity catalog lineage tables, is there any solution that we can implement to get the latest lineage from that table into azure purview? Can you please check and let us know
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-08T15:57:06.27+00:00

@Sri Lakshman Velugubantla

Thank you for your detailed explanation of the issue you are facing with duplicate lineage entries in Azure Purview. I understand that maintaining a separate table for the latest lineage data can be cumbersome and requires manual effort.

While Azure Purview does not currently have a built-in feature to automatically filter for the latest lineage from the historical data in the Databricks Unity Catalog lineage tables.

The best approach is to create views in Databricks on top of your existing lineage tables. These views should use window functions to select only the latest lineage based on the timestamp. Then, register and scan these views in Purview. This way, Purview will only ingest the latest lineage data automatically.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

I hope this information helps.

Thank you.
Sri Lakshman Velugubantla 20 Reputation points

2025-01-09T12:21:07.8033333+00:00

Hi @Chandra Boorla ,

i need to create table with below columns right?

source_entity_id : format is catalog.schema.table

target_entity_id : format is catalog.schema.table

relationship_type : JOIN, SELECT, AGG

timestamp : Date

source_column :Name of the source column

target_column: Name of the target column

once i created the above table and added lineage for one table PSB

test.silver.customer,test.gold.customer,JOIN,2025-09,CustomerName,CustomerNm

test.silver.customer,test.gold.customer,JOIN,2025-09,CustomerNumber,CustomerNum

Now i scanned the purview and after scan how my lineage looks like it will create one process for both columns or it will create two processes for both tables. Because if its creating two processes between tables for two columns then it will be a issue

if it creates two processes then how to resolve it

i'm expecting one process that connect all source tables for that target table and that process will be having all columns lineage

So, can you please help me on this
Sri Lakshman Velugubantla 20 Reputation points

2025-01-10T18:35:03.2566667+00:00

Hi @Chandra Boorla , can you please provide amswer my above question
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-13T06:06:02.21+00:00

@Sri Lakshman Velugubantla

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-15T18:04:30.0466667+00:00

@Sri Lakshman Velugubantla

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

2 answers

Your answer

Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-07T16:56:15+00:00

@Sri Lakshman Velugubantla

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Sri Lakshman Velugubantla 20 Reputation points

2025-01-07T17:14:19.01+00:00

Hi @Chandra Boorla ,

I thought to create separate table to put latest lineage there, but it requires manual effort every time to update the lineage in that table so, since we have that lineage available in Databricks unity catalog lineage tables, is there any solution that we can implement to get the latest lineage from that table into azure purview? Can you please check and let us know
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-08T15:57:06.27+00:00

@Sri Lakshman Velugubantla

Thank you for your detailed explanation of the issue you are facing with duplicate lineage entries in Azure Purview. I understand that maintaining a separate table for the latest lineage data can be cumbersome and requires manual effort.

While Azure Purview does not currently have a built-in feature to automatically filter for the latest lineage from the historical data in the Databricks Unity Catalog lineage tables.

The best approach is to create views in Databricks on top of your existing lineage tables. These views should use window functions to select only the latest lineage based on the timestamp. Then, register and scan these views in Purview. This way, Purview will only ingest the latest lineage data automatically.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

I hope this information helps.

Thank you.
Sri Lakshman Velugubantla 20 Reputation points

2025-01-09T12:21:07.8033333+00:00

Hi @Chandra Boorla ,

i need to create table with below columns right?

source_entity_id : format is catalog.schema.table

target_entity_id : format is catalog.schema.table

relationship_type : JOIN, SELECT, AGG

timestamp : Date

source_column :Name of the source column

target_column: Name of the target column

once i created the above table and added lineage for one table PSB

test.silver.customer,test.gold.customer,JOIN,2025-09,CustomerName,CustomerNm

test.silver.customer,test.gold.customer,JOIN,2025-09,CustomerNumber,CustomerNum

Now i scanned the purview and after scan how my lineage looks like it will create one process for both columns or it will create two processes for both tables. Because if its creating two processes between tables for two columns then it will be a issue

if it creates two processes then how to resolve it

i'm expecting one process that connect all source tables for that target table and that process will be having all columns lineage

So, can you please help me on this
Sri Lakshman Velugubantla 20 Reputation points

2025-01-10T18:35:03.2566667+00:00

Hi @Chandra Boorla , can you please provide amswer my above question
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-13T06:06:02.21+00:00

@Sri Lakshman Velugubantla

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Chandra Boorla 14,675 Reputation points Microsoft External Staff Moderator

2025-01-15T18:04:30.0466667+00:00

@Sri Lakshman Velugubantla

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Answer 1

David Broggy 6,371 MVP Volunteer Moderator

Hi @sri lakshman,

I'm not aware of a feature for Purview to read just the latest lineage.

You would need to create another table which contains only the latest lineage and have Purview scan that.

I appreciate that would mean you need to maintain a new table but that's my recommended solution.

good luck.

Sri Lakshman Velugubantla 20 Reputation points

2025-01-03T04:51:01.3733333+00:00

Hi @David Broggy ,

Can you please tell me which columns are needed for purview to pick up lineage from this newly created table?

Answer 2

@Sri Lakshman Velugubantla

Unfortunately, as of now, Azure Purview does not natively support combining multiple column-level lineage records into a single process during ingestion. Each row in the lineage table is ingested as an independent process, which is why you're observing separate processes for each column mapping.

You're right that Azure Purview, when ingesting lineage from a table like the one you described with source_column and target_column, will indeed create separate processes for each column mapping.

This means that instead of one process showing:

Process: Customer Transformation (JOIN) 
          - Source: test.silver.customer 
          - Target: test.gold.customer                           
                         - Source Column: CustomerName                           
                         - Target Column: CustomerNm                           
                         - Source Column: CustomerNumber                           
                         - Target Column: CustomerNum

You would actually see two separate processes:

Process 1: CustomerName to CustomerNm (JOIN) 
            - Source: test.silver.customer 
                          - Source Column: CustomerName 
            - Target: test.gold.customer 
                          - Target Column: CustomerNm 
Process 2: CustomerNumber to CustomerNum (JOIN) 
             - Source: test.silver.customer 
                            - Source Column: CustomerNumber 
             - Target: test.gold.customer 
                             - Target Column: CustomerNum

This is a key limitation when using a table to import lineage into Purview. It doesn't have the capability to aggregate multiple column mappings into a single process representing a broader transformation.

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Share via

Duplicate Lineage got inserted into Azure Purview

2 answers

Your answer