Share via


PostgreSQL connector FAQs

Important

The PostgreSQL connector for Lakeflow Connect is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.

This page answers frequently asked questions about the PostgreSQL connector in Databricks Lakeflow Connect.

General managed connector FAQs

The answers in Managed connector FAQs apply to all managed connectors in Lakeflow Connect. Keep reading for connector-specific FAQs.

How does Databricks connect to PostgreSQL?

Databricks connects to PostgreSQL using transport layer security (TLS) and a JDBC connection. Credentials are stored securely in Unity Catalog and can only be retrieved if the user who runs the ingestion flow has the appropriate permissions. Databricks recommends creating a separate replication user in PostgreSQL for ingesting data. If there are databases or tables you do not want to expose to this user, you can use the built-in PostgreSQL permissions.

If the pipeline fails, does ingestion resume without data loss?

Yes. Databricks keeps track of what the connector has extracted from the source and applied in the destination. If anything happens, Databricks can resume at that point as long as the replication slot and Write-Ahead Log (WAL) data remain on the source database. This can be impacted if the pipeline does not run before the WAL retention period or replication slot limits are reached, which requires a full refresh on the target tables.

Which PostgreSQL variations does the connector support?

The connector supports AWS RDS PostgreSQL, Aurora PostgreSQL, Amazon EC2, Azure Database for PostgreSQL, Azure virtual machines, and GCP Cloud SQL for PostgreSQL. This includes PostgreSQL running on virtual machines. The connector also supports on-premises PostgreSQL using Azure ExpressRoute, AWS Direct Connect, and VPN if sufficient bandwidth is available.

How does the connector incrementally pull data?

The connector uses PostgreSQL logical replication with the pgoutput plugin. Logical replication captures all data modification operations (inserts, updates, and deletes) through the Write-Ahead Log without significant performance impact on the source database.

Does the connector capture time zones for date and time columns?

The connector preserves the time zone information for TIMESTAMP WITH TIME ZONE columns. TIMESTAMP WITHOUT TIME ZONE and TIME columns are ingested as strings in their original format without time zone conversion.

Can I customize the schedule of the ingestion gateway?

No, the ingestion gateway must run in continuous mode. This is critical for PostgreSQL to prevent Write-Ahead Log (WAL) bloat and ensure that replication slots do not accumulate unconsumed changes. If the gateway is stopped for an extended period, the replication slot can cause WAL files to accumulate on the source database, potentially filling up the disk space.

How does the connector handle a table without a primary key?

The connector can replicate tables without a primary key if the replica identity is set to FULL. In this case, the connector treats all columns except large objects as a bundled primary key. If there are duplicate rows in the source table, these rows are ingested as a single row in the destination table unless you enable history tracking.

How often can I schedule the ingestion pipeline to run?

There is no limit on how often you can schedule the ingestion pipeline to run. However, Databricks recommends at least 5 minutes between intervals because it takes some time for the serverless compute to start up. Databricks does not support running the ingestion pipeline in continuous mode.

Why am I not seeing all of the rows from my database in the initial pipeline run?

The ingestion gateway extracts historical and CDC data as soon as it starts running. The ingestion pipeline might run before all of this data has been extracted, resulting in a partial application of data into target tables. It can take a few runs of the ingestion pipeline to have all of the data extracted and applied to target tables.

Can I ingest from a read replica or a standby instance?

No. Support is limited to primary PostgreSQL instances because logical replication is not supported on read replicas or standby instances.

What happens to the replication slot when I delete a pipeline?

When you delete an ingestion pipeline, the replication slot is not automatically removed from the source PostgreSQL database. You must manually drop the replication slot to prevent Write-Ahead Log (WAL) accumulation. See Clean up replication slots for instructions on cleaning up replication slots.

What PostgreSQL version is required?

PostgreSQL 13 or above is required.

Is wal_level = logical mandatory for CDC ingestion?

Yes. The wal_level parameter must be set to logical to enable logical replication.

Can I replicate tables from multiple PostgreSQL databases in one pipeline?

Yes. You can specify multiple source databases in the source_catalog field of the ingestion_definition. However, each source database requires its own Unity Catalog connection and publication configuration.

How many tables can I ingest in a single pipeline?

Databricks recommends ingesting 250 or fewer tables per pipeline for optimal performance. However, there is no hard limit on the number of rows or columns that are supported within these objects.

Does the connector support user-defined types and extensions?

The connector supports most PostgreSQL data types, including arrays and JSONB. User-defined types and third-party extension types are ingested as strings. See PostgreSQL connector reference for a complete list of supported type mappings.