Vector Search OpenAI(외부 포함 모델) 예제

이 Notebook은 벡터 검색을 위한 기본 API로서 제공되는 VectorSearchClient을 사용하는 Python SDK 사용 방법을 보여준다.

이 노트북은 Databricks의 외부 모델 지원을 활용하여 OpenAI 임베딩 모델에 액세스하고 임베딩을 생성합니다.

%pip install --upgrade --force-reinstall databricks-vectorsearch tiktoken
dbutils.library.restartPython()

from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient(disable_notice=True)

# Display help for the Vector Search Client
help(VectorSearchClient)

원본 델타 테이블에 토이 데이터 세트 로드

다음은 원본 델타 테이블을 만듭니다.

# Specify the catalog and schema to use. You must have USE_CATALOG privilege on the catalog and USE_SCHEMA and CREATE_TABLE privileges on the schema.
# Change the catalog and schema here if necessary.

catalog_name = "main"
schema_name = "default"


source_table_name = "wiki_articles_demo"
source_table_fullname = f"{catalog_name}.{schema_name}.{source_table_name}"

# Uncomment the following line if you want to start from scratch.

# spark.sql(f"DROP TABLE {source_table_fullname}")

source_df = spark.read.parquet("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet").limit(10)
display(source_df)

청크 샘플 데이터 세트

샘플 데이터 세트를 청크하면 포함 모델의 컨텍스트 제한을 초과하지 않도록 방지할 수 있습니다. OpenAI 모델은 최대 8192개의 토큰을 지원합니다. 그러나 Databricks는 RAG 애플리케이션의 추론 모델에 더 다양한 예제를 공급할 수 있도록 데이터를 더 작은 컨텍스트 청크로 분할하는 것이 좋습니다.

import tiktoken
import pandas as pd


max_chunk_tokens = 1024
encoding = tiktoken.get_encoding("cl100k_base")


def chunk_text(text):
    # Encode and then decode within the UDF
    tokens = encoding.encode(text)
    chunks = []
    while tokens:
        chunk_tokens = tokens[:max_chunk_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
        tokens = tokens[max_chunk_tokens:]
    return chunks

# Process the data and store in a new list
pandas_df = source_df.toPandas()
processed_data = []
for index, row in pandas_df.iterrows():
    text_chunks = chunk_text(row['text'])
    chunk_no = 0
    for chunk in text_chunks:
        row_data = row.to_dict()

        # Replace the id column with a new unique chunk id
        # and the text column with the text chunk
        row_data['id'] = f"{row['id']}_{chunk_no}"
        row_data['text'] = chunk

        processed_data.append(row_data)
        chunk_no += 1

chunked_pandas_df = pd.DataFrame(processed_data)
chunked_spark_df = spark.createDataFrame(chunked_pandas_df)

# Write the chunked DataFrame to a Delta table
spark.sql(f"DROP TABLE IF EXISTS {source_table_fullname}")
chunked_spark_df.write.format("delta") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(source_table_fullname)

display(spark.sql(f"SELECT * FROM {source_table_fullname}"))

벡터 검색 엔드포인트 만들기

vector_search_endpoint_name = "vector-search-demo-endpoint"

vsc.create_endpoint(
    name=vector_search_endpoint_name,
    endpoint_type="STANDARD" # or "STORAGE_OPTIMIZED"
)

vsc.get_endpoint(
  name=vector_search_endpoint_name
)

OpenAI 포함 모델 엔드포인트 등록

자세한 사용량 정보는 OpenAI 엔드포인트를 구성하기 위한 외부 모델 설명서를 참조하세요.

자격 증명을 제공하려면 Databricks 비밀 관리자를 사용합니다.

embedding_model_endpoint_name = "openai-embedding-endpoint"

import mlflow.deployments

mlflow_deploy_client = mlflow.deployments.get_deploy_client("databricks")

# Configure the secret manager with the OpenAPI key and provide the
# correct scope and key name below.

mlflow_deploy_client.create_endpoint(
    name=embedding_model_endpoint_name,
    config={
        "served_entities": [{
            "external_model": {
                "name": "text-embedding-ada-002",
                "provider": "openai",
                "task": "llm/v1/embeddings",
                "openai_config": {
                    "openai_api_key": "{{secrets/demo/openai-api-key}}" # CHANGE ME
                }
            }
    }]
    }
)

벡터 인덱스 만들기

# Vector index
vs_index = f"{source_table_name}_openai_index"
vs_index_fullname = f"{catalog_name}.{schema_name}.{vs_index}"

index = vsc.create_delta_sync_index(
  endpoint_name=vector_search_endpoint_name,
  source_table_name=source_table_fullname,
  index_name=vs_index_fullname,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_source_column="text",
  embedding_model_endpoint_name=embedding_model_endpoint_name
)
index.describe()['status']['message']

# Wait for index to come online. Expect this command to take several minutes.
# You can also track the status of the index build in Catalog Explorer in the
# Overview tab for the vector index.

import time
index = vsc.get_index(endpoint_name=vector_search_endpoint_name,index_name=vs_index_fullname)
while not index.describe().get('status')['ready']:
  print("Waiting for index to be ready...")
  time.sleep(30)
print("Index is ready!")
index.describe()

유사성 검색

다음 셀에서는 비슷한 문서를 찾기 위해 벡터 인덱스 쿼리 방법을 보여 있습니다.

results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5
  )
rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

# Search with a filter. Note that the syntax depends on the endpoint type.

# Standard endpoint syntax
results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5,
  filters={"title NOT": "Hercules"}
)

# Storage-optimized endpoint syntax
# results = index.similarity_search(
#   query_text="Greek myths",
#   columns=["id", "text", "title"],
#   num_results=5,
#   filters='title != "Hercules"'
#   )

rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

벡터 인덱스 삭제

vsc.delete_index(
  endpoint_name=vector_search_endpoint_name,
  index_name=vs_index_fullname
)

예제 노트

Vector Search OpenAI(외부 포함 모델) 예제

노트북 받기

피드백

이 페이지가 도움이 되었나요?

Last updated on 2026-04-25

Vector Search OpenAI(외부 포함 모델) 예제

원본 델타 테이블에 토이 데이터 세트 로드

청크 샘플 데이터 세트

벡터 검색 엔드포인트 만들기

OpenAI 포함 모델 엔드포인트 등록

벡터 인덱스 만들기

유사성 검색

벡터 인덱스 삭제

예제 노트

Vector Search OpenAI(외부 포함 모델) 예제

피드백

추가 리소스