Query-based connector limitations

General limitations

  • Query-based connectors query the source on a schedule. They don't provide continuous ingestion. If you need lower latency, use a managed CDC database connector.
  • The cursor column must be a single column. Composite cursors (multiple columns combined) aren't supported. The column value must be monotonically increasing. Rows with cursor values at or below the stored high-water mark aren't reingested.
  • Rows with a NULL cursor column are not ingested.

Source table requirements and recommendations

Consider the following source table characteristics when you design a query-based ingestion workload. They affect how a table is ingested and how well the workload performs.

Primary key, unique key, and cursor

Query-based connectors use a primary key (PK) or unique key (UK) to identify rows and a cursor column to detect changes. The ingestion behavior depends on which of these are available on the source table:

Source table Ingestion behavior
PK or UK with a cursor Regular snapshot and incremental ingestion.
PK or UK without a cursor Batch snapshot mode (APPLY CHANGES FROM SNAPSHOT). The connector re-reads the full table on each run, so this mode is best suited to smaller tables.
No PK or UK, cursor available The connector generates a synthetic key from all columns. This works for tables with no duplicate rows. For tables with fully duplicate rows, use APPEND_ONLY storage mode.
No PK or UK, no cursor The connector generates a synthetic key from all columns. Use APPEND_ONLY storage mode with any column as the cursor, and always full refresh. For tables with fully duplicate rows, APPEND_ONLY is required because the synthetic key can't distinguish identical rows.

Partitioning and indexing recommendations

  • Partitioning column: The connector automatically selects a column to partition the source query for parallel reads. It first tries the leading primary key column, provided it's a numeric, string, date, or timestamp type. If that column isn't suitable, it selects a unique key column instead. For balanced partitioning, the column should have:

    • High cardinality: At least 1,000 distinct values is recommended for tables approaching 1 TB.
    • Low data skew: Avoid columns whose values cluster heavily around a few values. Skewed columns create oversized partitions that can cause query timeouts and lose progress when a run restarts.

    If the automatically selected column doesn't meet these criteria, contact your Databricks account team to request a partitioning column override.

  • Indexed cursor column: Index the cursor column on the source database. An unindexed cursor column degrades incremental query performance.

API-only features

The following features are supported for query-based connectors, but only using the API:

  • Row filtering
  • Soft-deletion tracking (deletion_condition)
  • Hard-deletion tracking (Beta)
  • Append-only ingestion (APPEND_ONLY SCD mode)
  • Multi-destination catalog and schema