最近 BY 子句

適用於： 勾選已標示是 Databricks 執行時間 18.3 及以上版本

在自訂距離或相似度表達式上擴展， JOIN 排名前 k 位。對於查詢（左）table_reference中的每一列，它會根據從目標（右）資料表ranking_expression中尋找最多num_results頂尖匹配的列，並以串接列的形式回傳。

ranking_expression 可以是任何可排序的純量表達式，從兩個表格中計分一對行——例如 vector_cosine_similarity、 vector_l2_distance、 vector_inner_product，或是結合多個函數的複合表達式。

Syntax

{ INNER | LEFT [ OUTER ] } JOIN target_table_reference
  { APPROX | EXACT } NEAREST [ num_results ]
  BY { DISTANCE | SIMILARITY } ranking_expression

Parameters

target_table_reference

搜尋目標表格。可以是表格、子查詢或 CTE。
{ INNER | LEFT [ OUTER ] }

Optional. 接合類型。預設值為 INNER。
- INNER 丟棄沒有匹配候選的查詢列。
- LEFT OUTER 回傳每一個查詢列。目標端欄位則是 NULL 當沒有候選者存在時——例如，當目標資料表為空或每個候選者皆為 NULL。若查詢列候選數量少 num_results 於候選，則僅回傳可用候選者。
其他連接類型（RIGHT， FULL， SEMI， ANTICROSS，， NATURAL）則加高NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE。
{ APPROX | EXACT }

控制結果集合約。
- EXACT 回傳下 ranking_expression精確的頂 k 列。
- APPROX 回傳一個近似精確排名的前K組。優化器可能會使用更快且近似的搜尋策略，而非評估每個候選者。
最近的 [ num_results ]

可選的正整數字面值。預設為 1。一定在範圍內 [1, 100000]。若目標表匹配列數少於 num_results，則僅回傳可用列數。

值超出範圍時會升高 NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE。
依距離 |相似性

設定的 ranking_expression排序。
- DISTANCE 先依最小值排序行（最近=最小距離）。
- SIMILARITY 依最大值排序（最近=最高相似度）。
ranking_expression

一個可以參考兩個表格欄位的純量表達式。

常見的選擇包括：
- 相似函數如 vector_cosine_similarity 和 vector_inner_product，
- 距離函數如 vector_l2_distance，
- 數值距離如曼哈頓距離： vector_norm(zip_with(a.col, b.col, (x, y) -> x - y), 1.0f)。
若此表達式回傳不支援排序的資料型別，如 MAP，Azure Databricks 會產生 DATATYPE_MISMATCH。INVALID_ORDERING_TYPE。

註釋

不對稱性

NEAREST BY 不是交換的。查詢端會錨定結果——每個查詢列最多產生 num_results 輸出列：

當來自的 100 列users與的 NEAREST 51,000 列products連接時，連接最多返回 500 列。
如果你將連接的兩側切換成與 users的連接products，最多可返回 5,000 列。

交換兩邊會問不同的問題，因此即使對， INNER JOIN結果也會不同。

Streaming

NEAREST BY 不支援串流資料幀或資料集。對串流來源的查詢會引發 NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED。

嵌入輸入

使用向量計分函數時，兩個向量參數的維度必須 ARRAY<FLOAT> 相同。請參見 vector_cosine_similarity 函式關於類型與 NULL 處理規則的說明。

若要從字串值計算嵌入，請使用 Databricks 託管的嵌入模型，例如 databricks-gte-large-enai_query 。

常見錯誤條件

DATATYPE_MISMATCH。INVALID_ORDERING_TYPE
NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE
NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED
NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE

Examples

以下範例使用這些表格。為簡潔起見，嵌入以三維向量表示;實務上它們是更高維度的，並由嵌入模型計算。

> CREATE TEMP VIEW users(user_id, name, embedding) AS
    VALUES
      (1, 'Alice', ARRAY(1.0f, 0.0f, 0.0f)),
      (2, 'Bob',   ARRAY(0.0f, 1.0f, 0.0f)),
      (3, 'Carol', ARRAY(0.0f, 0.0f, 0.0f));

> CREATE TEMP VIEW products(product_id, name, price, country, embedding) AS
    VALUES
      ('P1', 'Trail running shoes', 120, 'EU', ARRAY(0.9f, 0.1f, 0.1f)),
      ('P2', 'Hiking boots',        180, 'EU', ARRAY(0.8f, 0.2f, 0.0f)),
      ('P3', 'Office shoes',         95, 'US', ARRAY(0.1f, 0.9f, 0.1f)),
      ('P4', 'Sandals',              45, 'US', ARRAY(0.0f, 0.8f, 0.2f)),
      ('P5', 'Running shoes',       110, 'EU', ARRAY(0.5f, 0.5f, 0.0f));

-- Ad-hoc vector search with an explicit query vector.
> SELECT t.product_id, t.name
    FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
    INNER JOIN products t
      APPROX NEAREST 3 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 product_id  name
 ----------  -------------------
 P1          Trail running shoes
 P2          Hiking boots
 P5          Running shoes

-- Batch recommendations: for every user, return the 2 nearest products.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
    FROM users q
    INNER JOIN products t
      APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 user_id  name   product_id  product
 -------  -----  ----------  -------------------
 1        Alice  P1          Trail running shoes
 1        Alice  P2          Hiking boots
 2        Bob    P3          Office shoes
 2        Bob    P4          Sandals

-- Pre-filter the target table via a subquery (EU products only).
> SELECT q.user_id, q.name, t.product_id, t.name AS product, t.price
    FROM users q
    INNER JOIN (SELECT * FROM products WHERE country = 'EU') AS t
      APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 user_id  name   product_id  product              price
 -------  -----  ----------  -------------------  -----
 1        Alice  P1          Trail running shoes  120
 1        Alice  P2          Hiking boots         180
 2        Bob    P5          Running shoes        110
 2        Bob    P2          Hiking boots         180

-- LEFT OUTER returns every query row. Carol's embedding has zero magnitude,
-- so vector_cosine_similarity returns NULL for all comparisons and her row
-- is preserved with NULL target columns.
> SELECT q.user_id, q.name, t.product_id, t.name AS product
    FROM users q
    LEFT OUTER JOIN products t
      APPROX NEAREST 2 BY SIMILARITY vector_cosine_similarity(q.embedding, t.embedding);
 user_id  name   product_id  product
 -------  -----  ----------  -------------------
 1        Alice  P1          Trail running shoes
 1        Alice  P2          Hiking boots
 2        Bob    P3          Office shoes
 2        Bob    P4          Sandals
 3        Carol  NULL        NULL

-- EXACT returns the exact top-k under the ranking expression.
> SELECT t.product_id, t.name
    FROM (SELECT ARRAY(1.0f, 0.0f, 0.0f) AS embedding) q
    INNER JOIN products t
      EXACT NEAREST 3 BY DISTANCE vector_l2_distance(q.embedding, t.embedding);
 product_id  name
 ----------  -------------------
 P1          Trail running shoes
 P2          Hiking boots
 P5          Running shoes

意見反應

此頁面對您有幫助嗎？

Last updated on 2026-06-01