Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.
This page describes the source setup tasks for ingestion from PostgreSQL into Azure Databricks using Lakeflow Connect.
Logical replication for change data capture
The PostgreSQL connector uses logical replication to track changes in the source tables. Logical replication allows the connector to capture data modifications (inserts, updates, and deletes) without requiring triggers or significant overhead on the source database.
Lakeflow PostgreSQL Logical replication requires the following:
Lakeflow Connect supports data replication from PostgreSQL version 13 and later.
Configure the database for logical replication:
The PostgreSQL parameter
wal_levelmust be set tological.Create publications that include all tables you want to replicate.
Create replication slots for each catalog that will be replicated.
Note
Publications must be created before creating replication slots.
For more information about logical replication, see the Logical Replication documentation on the PostgreSQL website.
Overview of source setup tasks
Complete the following tasks in PostgreSQL before ingesting data into Azure Databricks:
Verify PostgreSQL 13 or above
Configure network access (security groups, firewall rules, or VPN)
Configure logical replication:
- Enable logical replication (
wal_level = logical)
- Create replication user with required privileges. See PostgreSQL database user requirements
- Set replica identity for tables. See Set replica identity for tables
- Creating publications and replication slots
- Enable logical replication (
Optional: Configure inline DDL tracking for automatic schema change detection. In case you want to opt for the inline DDL tracking, reach out to Databricks support.
Important
If you plan to replicate from multiple PostgreSQL databases, you must create a separate publication and replication slot for each database. The inline DDL tracking script (if used) must also be executed in each database.
Configure logical replication
To enable logical replication in PostgreSQL, configure the database settings and set up the necessary objects.
Set the WAL level to logical
The Write-Ahead Log (WAL) must be configured for logical replication. This setting typically requires a database restart.
Check the current
wal_levelsetting:SHOW wal_level;If the value is not
logical, setwal_level = logicalin the server configuration and restart the PostgreSQL service.
Create a replication user
Create a dedicated user for Databricks ingestion with replication privileges:
CREATE USER databricks_replication WITH PASSWORD 'your_secure_password';
GRANT CONNECT ON DATABASE your_database TO databricks_replication;
ALTER USER databricks_replication WITH REPLICATION;
For detailed privilege requirements, see PostgreSQL database user requirements.
Set replica identity for tables
For each table you want to replicate, configure the replica identity. The correct setting depends on the table structure:
| Table structure | Required REPLICA IDENTITY | Command |
|---|---|---|
Table has primary key, and does not contain TOASTable columns (for example, TEXT, BYTEA, VARCHAR(n) with large values) |
DEFAULT |
ALTER TABLE schema_name.table_name REPLICA IDENTITY DEFAULT; |
| Table has primary key, but includes large variable-length (TOASTable) columns | FULL |
ALTER TABLE schema_name.table_name REPLICA IDENTITY FULL; |
| Table does not have a primary key | FULL |
ALTER TABLE schema_name.table_name REPLICA IDENTITY FULL; |
For more information about replica identity settings, see Replica Identity in the PostgreSQL documentation.
Create a publication
Create a publication in each database that includes the tables you want to replicate:
-- Create a publication for specific tables
CREATE PUBLICATION databricks_publication FOR TABLE schema_name.table1, schema_name.table2;
-- Or create a publication for all tables in a database
CREATE PUBLICATION databricks_publication FOR ALL TABLES;
Note
You must create a separate publication in each PostgreSQL database that you want to replicate.
Configure replication slot parameters
Before creating replication slots, configure the following server parameters:
Limit WAL retention for replication slots
Parameter: max_slot_wal_keep_size
It is recommended not to set max_slot_wal_keep_size to -1 (the default value), as this allows unbounded WAL bloat due to retention by lagging or inactive replication slots. Depending on your workload, set this parameter to a finite value.
Learn more about max_slot_wal_keep_size parameter in the official PostgreSQL documentation.
Note
Some managed cloud providers do not allow modification of this parameter and instead rely on built-in slot monitoring and auto-cleanup. Review the platform behavior before setting operational alerts.
For more information, see:
Configure replication slot capacity
Parameter: max_replication_slots
Each PostgreSQL database being replicated requires one logical replication slot. Set this parameter to at least the number of databases being replicated, plus any existing replication needs.
Configure WAL senders
Parameter: max_wal_senders
This parameter defines the maximum number of concurrent WAL sender processes that stream WAL data to subscribers. In most cases, you should have one WAL sender process for each replication slot to ensure efficient and consistent data replication.
Configure max_wal_senders to be at least equal to the number of replication slots in use, accounting for any other existing usage. It is recommended to set it slightly higher to provide operational flexibility.
Create a replication slot
Create a replication slot in each database that the Databricks ingestion gateway will use to track changes:
-- Create a replication slot with the pgoutput plugin
SELECT pg_create_logical_replication_slot('databricks_slot', 'pgoutput');
Important
- Replication slots hold WAL data until consumed by the connector. Configure the
max_slot_wal_keep_sizeparameter to limit WAL retention and prevent unbounded WAL growth. See Configure replication slot parameters for details. - When you delete an ingestion pipeline, you must manually drop the associated replication slot. See Clean up replication slots.
Optional: Configure inline DDL tracking
Inline DDL tracking is an optional feature that allows the connector to automatically detect and apply schema changes from the source database. This feature is disabled by default.
Warning
Inline DDL tracking is currently in preview and requires contacting Databricks Support to enable it for your workspace.
For information about which schema changes are handled automatically and which require a full refresh, see How do managed connectors handle schema evolution? and Schema evolution.
Set up inline DDL tracking
If inline DDL tracking has been enabled for your workspace, complete these steps in each PostgreSQL database:
Download and run the lakeflow_pg_ddl_change_tracking.sql script:
\i lakeflow_pg_ddl_change_tracking.sqlVerify that the triggers and audit table were created successfully:
-- Check for the DDL audit table SELECT * FROM pg_tables WHERE tablename = 'lakeflow_ddl_audit'; -- Check for the event triggers SELECT * FROM pg_event_trigger WHERE evtname LIKE 'lakeflow%';Add the DDL audit table to your publication:
ALTER PUBLICATION databricks_publication ADD TABLE public.lakeflow_ddl_audit;
Cloud-specific configuration notes
AWS RDS and Aurora
Ensure that the
rds.logical_replicationparameter is set to1in the parameter group.Configure security groups to allow connections from the Databricks workspace.
The replication user requires the
rds_replicationrole:GRANT rds_replication TO databricks_replication;
Azure Database for PostgreSQL
- Enable logical replication in the server parameters through the Azure portal or CLI.
- Configure firewall rules to allow connections from the Databricks workspace.
- For Flexible Server, logical replication is supported. For Single Server, ensure you're using a supported tier.
GCP Cloud SQL for PostgreSQL
- Enable the
cloudsql.logical_decodingflag in the instance settings. - Configure authorized networks to allow connections from the Databricks workspace.
- Ensure that the
cloudsql.enable_pglogicalflag is set toonif using pglogical extensions.
Verify the configuration
After completing the setup tasks, verify that logical replication is properly configured:
Check that the
wal_levelis set tological:SHOW wal_level;Verify that the replication user has the
replicationprivilege:SELECT rolname, rolreplication FROM pg_roles WHERE rolname = 'databricks_replication';Confirm that the publication exists:
SELECT * FROM pg_publication WHERE pubname = 'databricks_publication';Verify that the replication slot exists:
SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots WHERE slot_name = 'databricks_slot';Check the replica identity for your tables:
SELECT schemaname, tablename, relreplident FROM pg_tables t JOIN pg_class c ON t.tablename = c.relname WHERE schemaname = 'your_schema';The
relreplidentcolumn should showffor FULL replica identity.
Next steps
After completing the source setup, you can create an ingestion gateway and pipeline to ingest data from PostgreSQL. See Ingest data from PostgreSQL.