How to get lineage from Airflow into Microsoft Purview (Preview)
Article
Airflow is an open-source workflow automation and scheduling platform that can be used to author and manage data pipelines. Microsoft Purview supports collecting Airflow lineage by integrating with OpenLineage, an open framework for data lineage collection and analysis. Learn about how Airflow works with OpenLineage from here.
Enabling OpenLineage in Airflow automatically tracks metadata and lineage about jobs and datasets as DAGs execute. The information is sent to an Azure Event Hubs that you configure. Microsoft Purview subscribes to the events, parse them and ingest into the data map.
Important
This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include additional legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability.
Supported capabilities
The supported Airflow versions are 1.10+ and 2.0-2.7.
Microsoft Purview supports metadata and lineage collection when the following types of data sources are used in Airflow:
Amazon RDS for PostgreSQL
Azure Database for PostgreSQL
Google BigQuery
PostgreSQL
Snowflake
The following Airflow metadata are captured along:
Airflow workspace
Airflow DAG
Airflow task
Lineage is collected into Microsoft Purview upon successful DAG runs in an event-based manner.
Known limitations
Column level lineage is currently not supported. The schema of the data assets is captured.
If database views are referenced in the tasks, they're currently captured as table assets.
All the metadata are ingested into Microsoft Purview root collection. The assets already existed in the data map are preserved in the configured collection.
How to bring Airflow lineage into Microsoft Purview
As a prerequisite, you need a running Airflow instance.
To get lineage from Airflow into Microsoft Purview, you need to:
Create an event hub. Name your event hub as “microsoft_internal_openlineage”.
Go to your “microsoft_internal_openlineage” event hubs -> Access control (IAM) -> Add role assignment, assign “Azure Event Hubs Data Receiver” role to your Microsoft Purview account’s managed identity. For detailed steps, see Assign Azure roles using the Azure portal.
Configure Event Hubs to publish messages to Microsoft Purview
Go to your Microsoft Purview account -> Managed resources tab, disable the managed Event Hubs namespace.
Go to Kafka configuration tab -> + Add configuration -> Hook configuration, input a name, and select the Event Hubs namespace and Event Hubs you created in previous step.
Configure your Airflow with OpenLineage
Installation:
To download and install the latest ‘openlineage-airflow’ library, update the ‘requirements.txt’ file of your running Airflow instance with:
openlineage-airflow
Note
The Airflow version and the openlineage-airflow version need to match. For example, when you use Airflow 2.7.1, you can use openlineage-airflow version 1.1.0 or 1.2.0. You can view matching versions on this website.
Configuration:
Next, configure your Azure Event Hubs instance as the target to which OpenLineage sends the events.
Create an ‘openlineage.yml’ file under your Airflow root path. The content of the file is as below:
Get the {EVENTHUB_SERVER} from your Event Hubs namespace -> Host name:
Get the {PASSWORD} from your Event Hubs namespace -> Shared access policies -> Connection string - primary key or Connection string - secondary key:
Restart your Airflow server.
Run Airflow jobs and view the assets/lineage
You can now run your Airflow jobs, then go to Microsoft Purview Governance portal to browse/search/view assets. The assets should show up shortly after a successful DAG run.
Browse Airflow assets:
View Airflow task asset details with lineage:
Troubleshooting tips
If you run the Airflow job but don’t see the corresponding assets/lineage shown up in Microsoft Purview:
Check if your Airflow use case is supported by Microsoft Purview. Refer to the supported capabilities section.
Go to your Event Hubs instance to check if any incoming requests and messages. If no, double check your OpenLineage configuration in Airflow.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.