Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
General limitations
- Query-based connectors query the source on a schedule. They don't provide continuous ingestion. If you need lower latency, use a managed CDC database connector.
- The cursor column must be a single column. Composite cursors (multiple columns combined) aren't supported. The column value must be monotonically increasing. Rows with cursor values at or below the stored high-water mark aren't reingested.
- Rows with a NULL cursor column are not ingested.
Source table requirements and recommendations
Consider the following source table characteristics when you design a query-based ingestion workload. They affect how a table is ingested and how well the workload performs.
Primary key, unique key, and cursor
Query-based connectors use a primary key (PK) or unique key (UK) to identify rows and a cursor column to detect changes. The ingestion behavior depends on which of these are available on the source table:
| Source table | Ingestion behavior |
|---|---|
| PK or UK with a cursor | Regular snapshot and incremental ingestion. |
| PK or UK without a cursor | Batch snapshot mode (APPLY CHANGES FROM SNAPSHOT). The connector re-reads the full table on each run, so this mode is best suited to smaller tables. |
| No PK or UK, cursor available | The connector generates a synthetic key from all columns. This works for tables with no duplicate rows. For tables with fully duplicate rows, use APPEND_ONLY storage mode. |
| No PK or UK, no cursor | The connector generates a synthetic key from all columns. Use APPEND_ONLY storage mode with any column as the cursor, and always full refresh. For tables with fully duplicate rows, APPEND_ONLY is required because the synthetic key can't distinguish identical rows. |
Partitioning and indexing recommendations
Partitioning column: The connector automatically selects a column to partition the source query for parallel reads. It first tries the leading primary key column, provided it's a numeric, string, date, or timestamp type. If that column isn't suitable, it selects a unique key column instead. For balanced partitioning, the column should have:
- High cardinality: At least 1,000 distinct values is recommended for tables approaching 1 TB.
- Low data skew: Avoid columns whose values cluster heavily around a few values. Skewed columns create oversized partitions that can cause query timeouts and lose progress when a run restarts.
If the automatically selected column doesn't meet these criteria, contact your Databricks account team to request a partitioning column override.
Indexed cursor column: Index the cursor column on the source database. An unindexed cursor column degrades incremental query performance.
API-only features
The following features are supported for query-based connectors, but only using the API:
- Row filtering
- Soft-deletion tracking (
deletion_condition) - Hard-deletion tracking (Beta)
- Append-only ingestion (
APPEND_ONLYSCD mode) - Multi-destination catalog and schema