Share via


Configure PostgreSQL for ingestion into Azure Databricks

Important

The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.

This page describes the source setup tasks for ingestion from PostgreSQL into Azure Databricks using Lakeflow Connect.

Logical replication for change data capture

The PostgreSQL connector uses logical replication to track changes in the source tables. Logical replication allows the connector to capture data modifications (inserts, updates, and deletes) without requiring triggers or significant overhead on the source database.

Lakeflow PostgreSQL Logical replication requires the following:

  1. Lakeflow Connect supports data replication from PostgreSQL version 13 and later.

  2. Configure the database for logical replication:

    The PostgreSQL parameter wal_level must be set to logical.

  3. Create publications that include all tables you want to replicate.

  4. Create replication slots for each catalog that will be replicated.

Note

Publications must be created before creating replication slots.

For more information about logical replication, see the Logical Replication documentation on the PostgreSQL website.

Overview of source setup tasks

Complete the following tasks in PostgreSQL before ingesting data into Azure Databricks:

  1. Verify PostgreSQL 13 or above

  2. Configure network access (security groups, firewall rules, or VPN)

  3. Configure logical replication:

    • Enable logical replication (wal_level = logical)
  4. Optional: Configure inline DDL tracking for automatic schema change detection. In case you want to opt for the inline DDL tracking, reach out to Databricks support.

Important

If you plan to replicate from multiple PostgreSQL databases, you must create a separate publication and replication slot for each database. The inline DDL tracking script (if used) must also be executed in each database.

Configure logical replication

To enable logical replication in PostgreSQL, configure the database settings and set up the necessary objects.

Set the WAL level to logical

The Write-Ahead Log (WAL) must be configured for logical replication. This setting typically requires a database restart.

  1. Check the current wal_level setting:

    SHOW wal_level;
    
  2. If the value is not logical, set wal_level = logical in the server configuration and restart the PostgreSQL service.

Create a replication user

Create a dedicated user for Databricks ingestion with replication privileges:

CREATE USER databricks_replication WITH PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE your_database TO databricks_replication;
ALTER USER databricks_replication WITH REPLICATION;

For detailed privilege requirements, see PostgreSQL database user requirements.

Set replica identity for tables

For each table you want to replicate, configure the replica identity. The correct setting depends on the table structure:

Table structure Required REPLICA IDENTITY Command
Table has primary key, and does not contain TOASTable columns (for example, TEXT, BYTEA, VARCHAR(n) with large values) DEFAULT ALTER TABLE schema_name.table_name REPLICA IDENTITY DEFAULT;
Table has primary key, but includes large variable-length (TOASTable) columns FULL ALTER TABLE schema_name.table_name REPLICA IDENTITY FULL;
Table does not have a primary key FULL ALTER TABLE schema_name.table_name REPLICA IDENTITY FULL;

For more information about replica identity settings, see Replica Identity in the PostgreSQL documentation.

Create a publication

Create a publication in each database that includes the tables you want to replicate:

-- Create a publication for specific tables
CREATE PUBLICATION databricks_publication FOR TABLE schema_name.table1, schema_name.table2;

-- Or create a publication for all tables in a database
CREATE PUBLICATION databricks_publication FOR ALL TABLES;

Note

You must create a separate publication in each PostgreSQL database that you want to replicate.

Configure replication slot parameters

Before creating replication slots, configure the following server parameters:

Limit WAL retention for replication slots

Parameter: max_slot_wal_keep_size

It is recommended not to set max_slot_wal_keep_size to -1 (the default value), as this allows unbounded WAL bloat due to retention by lagging or inactive replication slots. Depending on your workload, set this parameter to a finite value.

Learn more about max_slot_wal_keep_size parameter in the official PostgreSQL documentation.

Note

Some managed cloud providers do not allow modification of this parameter and instead rely on built-in slot monitoring and auto-cleanup. Review the platform behavior before setting operational alerts.

For more information, see:

Configure replication slot capacity

Parameter: max_replication_slots

Each PostgreSQL database being replicated requires one logical replication slot. Set this parameter to at least the number of databases being replicated, plus any existing replication needs.

Configure WAL senders

Parameter: max_wal_senders

This parameter defines the maximum number of concurrent WAL sender processes that stream WAL data to subscribers. In most cases, you should have one WAL sender process for each replication slot to ensure efficient and consistent data replication.

Configure max_wal_senders to be at least equal to the number of replication slots in use, accounting for any other existing usage. It is recommended to set it slightly higher to provide operational flexibility.

Create a replication slot

Create a replication slot in each database that the Databricks ingestion gateway will use to track changes:

-- Create a replication slot with the pgoutput plugin
SELECT pg_create_logical_replication_slot('databricks_slot', 'pgoutput');

Important

  • Replication slots hold WAL data until consumed by the connector. Configure the max_slot_wal_keep_size parameter to limit WAL retention and prevent unbounded WAL growth. See Configure replication slot parameters for details.
  • When you delete an ingestion pipeline, you must manually drop the associated replication slot. See Clean up replication slots.

Optional: Configure inline DDL tracking

Inline DDL tracking is an optional feature that allows the connector to automatically detect and apply schema changes from the source database. This feature is disabled by default.

Warning

Inline DDL tracking is currently in preview and requires contacting Databricks Support to enable it for your workspace.

For information about which schema changes are handled automatically and which require a full refresh, see How do managed connectors handle schema evolution? and Schema evolution.

Set up inline DDL tracking

If inline DDL tracking has been enabled for your workspace, complete these steps in each PostgreSQL database:

  1. Download and run the lakeflow_pg_ddl_change_tracking.sql script:

    \i lakeflow_pg_ddl_change_tracking.sql
    
  2. Verify that the triggers and audit table were created successfully:

    -- Check for the DDL audit table
    SELECT * FROM pg_tables WHERE tablename = 'lakeflow_ddl_audit';
    
    -- Check for the event triggers
    SELECT * FROM pg_event_trigger WHERE evtname LIKE 'lakeflow%';
    
  3. Add the DDL audit table to your publication:

    ALTER PUBLICATION databricks_publication ADD TABLE public.lakeflow_ddl_audit;
    

Cloud-specific configuration notes

AWS RDS and Aurora

  • Ensure that the rds.logical_replication parameter is set to 1 in the parameter group.

  • Configure security groups to allow connections from the Databricks workspace.

  • The replication user requires the rds_replication role:

    GRANT rds_replication TO databricks_replication;
    

Azure Database for PostgreSQL

  • Enable logical replication in the server parameters through the Azure portal or CLI.
  • Configure firewall rules to allow connections from the Databricks workspace.
  • For Flexible Server, logical replication is supported. For Single Server, ensure you're using a supported tier.

GCP Cloud SQL for PostgreSQL

  • Enable the cloudsql.logical_decoding flag in the instance settings.
  • Configure authorized networks to allow connections from the Databricks workspace.
  • Ensure that the cloudsql.enable_pglogical flag is set to on if using pglogical extensions.

Verify the configuration

After completing the setup tasks, verify that logical replication is properly configured:

  1. Check that the wal_level is set to logical:

    SHOW wal_level;
    
  2. Verify that the replication user has the replication privilege:

    SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'databricks_replication';
    
  3. Confirm that the publication exists:

    SELECT * FROM pg_publication WHERE pubname = 'databricks_publication';
    
  4. Verify that the replication slot exists:

    SELECT slot_name, slot_type, active, restart_lsn
    FROM pg_replication_slots
    WHERE slot_name = 'databricks_slot';
    
  5. Check the replica identity for your tables:

    SELECT schemaname, tablename, relreplident
    FROM pg_tables t
    JOIN pg_class c ON t.tablename = c.relname
    WHERE schemaname = 'your_schema';
    

    The relreplident column should show f for FULL replica identity.

Next steps

After completing the source setup, you can create an ingestion gateway and pipeline to ingest data from PostgreSQL. See Ingest data from PostgreSQL.