DataBricks Unity Catalog Lineage

S, Santhosh M 0 Reputation points
2024-06-20T06:15:49.5666667+00:00

Hi,

I'm looking for support on the Databricks Unity Catalog (on the data lineage). So I'm trying to establish lineage between 2 schemas (with 50 odd tables within each schema). Data for the first schema is fetched from source files (via ADF pipeline), the scala code is set up in such a way that the data for the second schema is fetched from first schema, when ADF pipeline (corresponding to this schema) is run.

PS: I've confirmed that the schemas satisfy all the requirements to capture lineage, i.e the tables are within the Unity metastore, Queries use the Spark DataFrame, I've been set up with "All Priveleges", the ADF pipeline is set up in a way that it spins up a new cluster everytime it runs (with version >14) and outbound firewall rules are set up as expected.

Note: There are 3 sets of scala that are triggered during run time for loading data from the first schema to the second.

Thanks

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,072 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 85,511 Reputation points Microsoft Employee
    2024-06-24T08:33:46.89+00:00

    @S, Santhosh M I apologize for the confusion earlier. You are correct that the Hive Metastore does not have the capability for lineage visualization.

    Regarding your specific query about establishing lineage between two schemas in Azure Databricks Unity Catalog, it is possible that the lineage is not being captured because the data for tables within the second schema is loaded from the first schema using Scala code.

    To capture lineage in this scenario, you can try the following steps:

    1. Make sure that the tables in both schemas are registered in the Azure Databricks Unity Catalog.
    2. Ensure that the Scala code used to load data from the first schema to the second schema is using Spark DataFrame APIs.
    3. Check if the Scala code is creating temporary views for the tables in both schemas. If not, you can create temporary views for the tables in both schemas using the createOrReplaceTempView method.
    4. Once the temporary views are created, you can use the spark.sql method to execute SQL queries that join the tables in both schemas. This will ensure that the lineage is captured in the Azure Databricks Unity Catalog.

    If you are still facing issues in capturing lineage between the two schemas, I would recommend you to check the logs and error messages to identify the root cause of the issue. You can also refer to the Azure Databricks documentation for more information on establishing lineage in Azure Databricks Unity Catalog.

    I hope this helps! Let me know if you have any further questions.