Esempio di modello di incorporamento esterno di Ricerca vettoriale (OpenAI)

Apri la versione notebook di questa pagina

Questo notebook illustra come usare Vector Search Python SDK, che fornisce un VectorSearchClient come API primaria per l'uso di Ricerca vettoriale.

Questo notebook usa il supporto di modelli esterni da parte di Databricks per accedere a un modello di embedding di OpenAI per generare embedding.

%pip install --upgrade --force-reinstall databricks-vectorsearch tiktoken
dbutils.library.restartPython()

from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient(disable_notice=True)

# Display help for the Vector Search Client
help(VectorSearchClient)

Caricare un set di dati toy nella tabella Delta di origine

Di seguito viene creata la tabella Delta di origine.

# Specify the catalog and schema to use. You must have USE_CATALOG privilege on the catalog and USE_SCHEMA and CREATE_TABLE privileges on the schema.
# Change the catalog and schema here if necessary.

catalog_name = "main"
schema_name = "default"


source_table_name = "wiki_articles_demo"
source_table_fullname = f"{catalog_name}.{schema_name}.{source_table_name}"

# Uncomment the following line if you want to start from scratch.

# spark.sql(f"DROP TABLE {source_table_fullname}")

source_df = spark.read.parquet("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet").limit(10)
display(source_df)

Set di dati di esempio a blocchi

La suddivisione in blocchi del set di dati di esempio consente di evitare di superare il limite di contesto del modello di incorporamento. Il modello OpenAI supporta fino a 8192 token. Tuttavia, Databricks consiglia di suddividere i dati in blocchi di contesto più piccoli in modo da poter inserire un'ampia gamma di esempi nel modello di ragionamento per l'applicazione RAG.

import tiktoken
import pandas as pd


max_chunk_tokens = 1024
encoding = tiktoken.get_encoding("cl100k_base")


def chunk_text(text):
    # Encode and then decode within the UDF
    tokens = encoding.encode(text)
    chunks = []
    while tokens:
        chunk_tokens = tokens[:max_chunk_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
        tokens = tokens[max_chunk_tokens:]
    return chunks

# Process the data and store in a new list
pandas_df = source_df.toPandas()
processed_data = []
for index, row in pandas_df.iterrows():
    text_chunks = chunk_text(row['text'])
    chunk_no = 0
    for chunk in text_chunks:
        row_data = row.to_dict()

        # Replace the id column with a new unique chunk id
        # and the text column with the text chunk
        row_data['id'] = f"{row['id']}_{chunk_no}"
        row_data['text'] = chunk

        processed_data.append(row_data)
        chunk_no += 1

chunked_pandas_df = pd.DataFrame(processed_data)
chunked_spark_df = spark.createDataFrame(chunked_pandas_df)

# Write the chunked DataFrame to a Delta table
spark.sql(f"DROP TABLE IF EXISTS {source_table_fullname}")
chunked_spark_df.write.format("delta") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(source_table_fullname)

display(spark.sql(f"SELECT * FROM {source_table_fullname}"))

Creare un endpoint di ricerca vettoriale

vector_search_endpoint_name = "vector-search-demo-endpoint"

vsc.create_endpoint(
    name=vector_search_endpoint_name,
    endpoint_type="STANDARD" # or "STORAGE_OPTIMIZED"
)

vsc.get_endpoint(
  name=vector_search_endpoint_name
)

Registrare l'endpoint del modello di incorporamento OpenAI

Per informazioni dettagliate sull'utilizzo, vedere la documentazione del modello esterno per la configurazione di un endpoint OpenAI.

Per specificare le credenziali, usare il gestore dei segreti di Databricks.

embedding_model_endpoint_name = "openai-embedding-endpoint"

import mlflow.deployments

mlflow_deploy_client = mlflow.deployments.get_deploy_client("databricks")

# Configure the secret manager with the OpenAPI key and provide the
# correct scope and key name below.

mlflow_deploy_client.create_endpoint(
    name=embedding_model_endpoint_name,
    config={
        "served_entities": [{
            "external_model": {
                "name": "text-embedding-ada-002",
                "provider": "openai",
                "task": "llm/v1/embeddings",
                "openai_config": {
                    "openai_api_key": "{{secrets/demo/openai-api-key}}" # CHANGE ME
                }
            }
    }]
    }
)

Creare un indice vettoriale

# Vector index
vs_index = f"{source_table_name}_openai_index"
vs_index_fullname = f"{catalog_name}.{schema_name}.{vs_index}"

index = vsc.create_delta_sync_index(
  endpoint_name=vector_search_endpoint_name,
  source_table_name=source_table_fullname,
  index_name=vs_index_fullname,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_source_column="text",
  embedding_model_endpoint_name=embedding_model_endpoint_name
)
index.describe()['status']['message']

# Wait for index to come online. Expect this command to take several minutes.
# You can also track the status of the index build in Catalog Explorer in the
# Overview tab for the vector index.

import time
index = vsc.get_index(endpoint_name=vector_search_endpoint_name,index_name=vs_index_fullname)
while not index.describe().get('status')['ready']:
  print("Waiting for index to be ready...")
  time.sleep(30)
print("Index is ready!")
index.describe()

Ricerca di somiglianza

Nelle celle seguenti viene illustrato come eseguire una query sull'indice vettoriale per trovare documenti simili.

results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5
  )
rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

# Search with a filter. Note that the syntax depends on the endpoint type.

# Standard endpoint syntax
results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5,
  filters={"title NOT": "Hercules"}
)

# Storage-optimized endpoint syntax
# results = index.similarity_search(
#   query_text="Greek myths",
#   columns=["id", "text", "title"],
#   num_results=5,
#   filters='title != "Hercules"'
#   )

rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

Eliminare l'indice vettoriale

vsc.delete_index(
  endpoint_name=vector_search_endpoint_name,
  index_name=vs_index_fullname
)

Notebook di esempio

Esempio di modello di incorporamento esterno di Ricerca vettoriale (OpenAI)

Ottieni il notebook

Commenti e suggerimenti

Questa pagina è stata utile?

Last updated on 2026-04-25

Esempio di modello di incorporamento esterno di Ricerca vettoriale (OpenAI)

Caricare un set di dati toy nella tabella Delta di origine

Set di dati di esempio a blocchi

Creare un endpoint di ricerca vettoriale

Registrare l'endpoint del modello di incorporamento OpenAI

Creare un indice vettoriale

Ricerca di somiglianza

Eliminare l'indice vettoriale

Notebook di esempio

Esempio di modello di incorporamento esterno di Ricerca vettoriale (OpenAI)

Commenti e suggerimenti

Risorse aggiuntive