Kopīgot, izmantojot


How to get lineage from Airflow into Microsoft Purview (Preview)

Airflow is an open-source workflow automation and scheduling platform that can be used to author and manage data pipelines. Microsoft Purview supports collecting Airflow lineage by integrating with OpenLineage, an open framework for data lineage collection and analysis. Learn about how Airflow works with OpenLineage from here.

Enabling OpenLineage in Airflow automatically tracks metadata and lineage about jobs and datasets as DAGs execute. The information is sent to an Azure Event Hubs that you configure. Microsoft Purview subscribes to the events, parse them and ingest into the data map.

Important

This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include additional legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability.

Supported capabilities

The supported Airflow versions are 1.10+ and 2.0-2.7.

Microsoft Purview supports metadata and lineage collection when the following types of data sources are used in Airflow:

  • Amazon RDS for PostgreSQL
  • Azure Database for PostgreSQL
  • Google BigQuery
  • PostgreSQL
  • Snowflake

The following Airflow metadata are captured along:

  • Airflow workspace
  • Airflow DAG
  • Airflow task

Lineage is collected into Microsoft Purview upon successful DAG runs in an event-based manner.

Known limitations

  • Column level lineage is currently not supported. The schema of the data assets is captured.
  • If database views are referenced in the tasks, they're currently captured as table assets.
  • All the metadata are ingested into Microsoft Purview root collection. The assets already existed in the data map are preserved in the configured collection.

How to bring Airflow lineage into Microsoft Purview

As a prerequisite, you need a running Airflow instance.

To get lineage from Airflow into Microsoft Purview, you need to:

  1. Set up an Azure Event Hubs
  2. Configure Event Hubs to publish messages to Microsoft Purview
  3. Configure your Airflow with OpenLineage
  4. Run Airflow jobs and view the assets/lineage

Set up Azure Event Hubs

Set up an Azure Event Hubs as the receiver of the metadata and lineage tracked by OpenLineage in Airflow.

  1. Create an Azure Event Hubs namespace.

  2. Create an event hub. Name your event hub as “microsoft_internal_openlineage”.

    Screenshot of create an new event hub in Azure Event Hubs

  3. Go to your “microsoft_internal_openlineage” event hubs -> Access control (IAM) -> Add role assignment, assign “Azure Event Hubs Data Receiver” role to your Microsoft Purview account’s managed identity. For detailed steps, see Assign Azure roles using the Azure portal.

Configure Event Hubs to publish messages to Microsoft Purview

Microsoft Purview supports consuming and pushing events from/to your own Event Hubs. Follow this doc to configure Event Hubs for Microsoft Purview: Configure Event Hubs with Microsoft Purview to send and receive Atlas Kafka topics messages.

In summary:

  • Go to your Microsoft Purview account -> Managed resources tab, disable the managed Event Hubs namespace.

  • Go to Kafka configuration tab -> + Add configuration -> Hook configuration, input a name, and select the Event Hubs namespace and Event Hubs you created in previous step.

    Screenshot of configure Event Hubs to publish messages to Microsoft Purview

Configure your Airflow with OpenLineage

Installation:

To download and install the latest ‘openlineage-airflow’ library, update the ‘requirements.txt’ file of your running Airflow instance with:

openlineage-airflow

Note

The Airflow version and the openlineage-airflow version need to match. For example, when you use Airflow 2.7.1, you can use openlineage-airflow version 1.1.0 or 1.2.0. You can view matching versions on this website.

Configuration:

Next, configure your Azure Event Hubs instance as the target to which OpenLineage sends the events.

  1. Create an ‘openlineage.yml’ file under your Airflow root path. The content of the file is as below:

    transport:
      type: "kafka"
      config:
        bootstrap.servers: "{EVENTHUB_SERVER}:9093"
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.username: "$ConnectionString"
        sasl.password: "{PASSWORD}"
        client.id: "airflow-client"
      topic: "microsoft_internal_openlineage"
      flash: True
    

    Replace the two place-holders with values:

    • Get the {EVENTHUB_SERVER} from your Event Hubs namespace -> Host name:

      Screenshot of Get host name from Event Hubs namespace

    • Get the {PASSWORD} from your Event Hubs namespace -> Shared access policies -> Connection string - primary key or Connection string - secondary key:

      Screenshot of Get event hub connection string key

  2. Restart your Airflow server.

Run Airflow jobs and view the assets/lineage

You can now run your Airflow jobs, then go to Microsoft Purview Governance portal to browse/search/view assets. The assets should show up shortly after a successful DAG run.

Browse Airflow assets:

Screenshot of Browse data catalog by source type

Screenshot of Browse Airflow source to see details

View Airflow task asset details with lineage:

Screenshot of View Airflow lineage details on Airflow task asset

Troubleshooting tips

If you run the Airflow job but don’t see the corresponding assets/lineage shown up in Microsoft Purview:

  • Check if your Airflow use case is supported by Microsoft Purview. Refer to the supported capabilities section.
  • Go to your Event Hubs instance to check if any incoming requests and messages. If no, double check your OpenLineage configuration in Airflow.

Next steps