Tutorial: Use an eventhouse as a vector database with LLM embeddings

In this tutorial, you learn how to use an eventhouse as a vector database to store and query vector data in Real-Time Intelligence. For general information about vector databases, see Vector databases.

The given scenario involves the use of semantic searches on Wikipedia pages to find pages with common themes. You use an available sample dataset, which includes vectors for tens of thousands of Wikipedia pages. These pages are embedded with an OpenAI model to produce vectors for each page. You store the vectors, along with some pertinent metadata related to the page, in an eventhouse. You can use this dataset to find pages that are similar to each other, or to find pages that are similar to some theme you want to find. For example, say you want to look up "famous female scientists of the 19th century." You encode this phrase using the same OpenAI model, and then run a vector similarity search over the stored Wikipedia page data to find the pages with the highest semantic similarity.

Specifically, in this tutorial you:

Prepare a table in the eventhouse with Vector16 encoding for the vector columns.
Store vector data from a pre-embedded dataset to an eventhouse.
Embed a natural language query by using the OpenAI model.
Use the series_cosine_similarity KQL function to calculate the similarities between the query embedding vector and those of the wiki pages.
View rows of the highest similarity to get the wiki pages that are most relevant to your search query.

You can visualize this flow as follows:

Prerequisites

A workspace with a Microsoft Fabric-enabled capacity.
An eventhouse in your workspace.
An Azure OpenAI resource with the text-embedding-ada-002 (Version 2) model deployed. This model is currently only available in certain regions. For more information, see Create a resource.
Download the sample notebook from the GitHub repository. The notebook is used to ingest the pre-embedded Wikipedia dataset regardless of which embedding approach you use for querying.
For the KQL queryset approach only: the ai_embeddings plugin configured with a callout policy and managed identity on your eventhouse.

Prepare your eventhouse environment

In this setup step, you create a table in an eventhouse with the necessary columns and encoding policies to store the vector data.

Browse to your workspace homepage in Real-Time Intelligence.
Select the eventhouse you created in the prerequisites.
Select the target database where you want to store the vector data. If you don't have a database, create one by selecting Add database.
Expand the database tree, select the embedded queryset, and copy and paste the following KQL query to create a table called Wiki with the necessary columns:
```
.create table Wiki (id:string,url:string,['title']:string,text:string,title_vector:dynamic,content_vector:dynamic,vector_id:long)
```

Copy and paste the following commands to set the encoding policy of the vector columns. Run these commands sequentially.

.alter column Wiki.title_vector policy encoding type='Vector16'

.alter column Wiki.content_vector policy encoding type='Vector16'

Write vector data to an eventhouse

Use the following steps to import the embedded Wikipedia data and write it in an eventhouse:

Import notebook

Download the sample vector-database-eventhouse-notebook notebook from the GitHub repository.
Browse to your Fabric environment. In the experience switcher, choose Fabric and then your workspace.
Select Import > Notebook > From this computer > Upload and then choose the notebook you downloaded in a previous step.
When the import finishes, open the imported notebook from your workspace.

Write data to the eventhouse

Run the cells to set up your environment.

%%configure -f
{"conf":
    {
        "spark.rpc.message.maxSize": "1024"
    }
}

%pip install wget

%pip install openai

Run the cells to download the precomputed embeddings.

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so it might take some time
wget.download(embeddings_url)

import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("/lakehouse/default/Files/data")

import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('/lakehouse/default/Files/data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

To write to the eventhouse, enter your Cluster URI, which you can find on the system overview page, and the name of the database. The table is created in the notebook and later referenced in the query.
```
# replace with your eventhouse Query URI, Database name, and Table name
KUSTO_CLUSTER =  "eventhouse Cluster URI"
KUSTO_DATABASE = "Database name"
KUSTO_TABLE = "Wiki"
```

Run the remaining cells to write the data to the eventhouse. This operation can take some time to execute.

kustoOptions = {"kustoCluster": KUSTO_CLUSTER, "kustoDatabase" :KUSTO_DATABASE, "kustoTable" : KUSTO_TABLE }

access_token=mssparkutils.credentials.getToken(kustoOptions["kustoCluster"])

#Pandas data frame to spark dataframe
sparkDF=spark.createDataFrame(article_df)

# Write data to a table in eventhouse
sparkDF.write. \
format("com.microsoft.kusto.spark.synapse.datasource"). \
option("kustoCluster",kustoOptions["kustoCluster"]). \
option("kustoDatabase",kustoOptions["kustoDatabase"]). \
option("kustoTable", kustoOptions["kustoTable"]). \
option("accessToken", access_token). \
option("tableCreateOptions", "CreateIfNotExist").\
mode("Append"). \
save()

View the data in the eventhouse

At this point, you can verify the data is written to the eventhouse by browsing to the database details page.

Browse to your workspace homepage in Real-Time Intelligence.
Select the database item that you provided in the previous section. You see a summary of the data that was written to the "Wiki" table. If the database is already opened, refresh it to see the new data.

Generate embedding for the search term

After you store the embedded wiki data in your eventhouse, embed a search term by using the same Azure OpenAI model. Then, compare it against the stored vectors to find similar Wikipedia pages.

Notebook
KQL queryset

To call the Azure OpenAI embedding API from the notebook, you need the following values:

Variable name	Value
endpoint	Find this value in the Keys & Endpoint section when you examine your resource in the Azure portal. An example endpoint is: `https://docs-test-001.openai.azure.com/`.
API key	Find this value in the Keys & Endpoint section when you examine your resource in the Azure portal. Use either KEY1 or KEY2.
deployment id	Find this value under the Deployments section in Azure OpenAI Studio.

Run the following cell to connect to Azure OpenAI and define the embedding function. Replace the placeholder values with your endpoint, API key, and deployment ID.

import openai

openai.api_version = '2022-12-01'
openai.api_base = 'endpoint'      # Add your endpoint here
openai.api_type = 'azure'
openai.api_key = 'api key'        # Add your API key here

def embed(query):
    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
            input=query,
            deployment_id="deployment id",  # Add your deployment ID here
            chunk_size=1
    )["data"][0]["embedding"]
    return embedded_query

Run the following cell to generate an embedding for your search term:

searchedEmbedding = embed("most difficult gymnastics moves in the olympics")

With the KQL queryset approach, you embed data inline by using the ai_embeddings plugin. You need the model endpoint URL, which combines your Azure OpenAI resource endpoint and deployment name.

Variable name	Value
model_endpoint	Constructed from your Azure OpenAI endpoint and deployment name, in the format: `https://<deployment-name>.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-10-21;impersonate`. Find the endpoint in the Keys & Endpoint section of your resource in the Azure portal.

Run the following query in your KQL queryset to verify the embedding is generated correctly. Use this same pattern in the next step to run the full similarity search.

let model_endpoint = 'https://<deployment-name>.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-10-21;impersonate';
let searchedEmbedding = toscalar(evaluate ai_embeddings("most difficult gymnastics moves in the olympics", model_endpoint));
print searchedEmbedding

Query the similarity

Use cosine similarity to compare the search term embedding against the stored Wikipedia page vectors and return the top 10 most similar pages. You can change the search term and rerun to explore different results.

Notebook
KQL queryset

The query runs via the Kusto Spark connector in the notebook, using the searchedEmbedding vector from the previous step.

kustoQuery = "Wiki | extend similarity = series_cosine_similarity(dynamic(" + str(searchedEmbedding) + "), content_vector) | top 10 by similarity desc"
accessToken = mssparkutils.credentials.getToken(KUSTO_CLUSTER)
kustoDf = spark.read \
    .format("com.microsoft.kusto.spark.synapse.datasource") \
    .option("accessToken", accessToken) \
    .option("kustoCluster", KUSTO_CLUSTER) \
    .option("kustoDatabase", KUSTO_DATABASE) \
    .option("kustoQuery", kustoQuery).load()

kustoDf.show()

The following query embeds the search term and runs the similarity search in a single KQL statement. Replace the model_endpoint value with your own.

let model_endpoint = 'https://<deployment-name>.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-10-21;impersonate';
let searchedEmbedding = toscalar(evaluate ai_embeddings("most difficult gymnastics moves in the olympics", model_endpoint));
Wiki
| extend similarity = series_cosine_similarity(searchedEmbedding, content_vector)
| top 10 by similarity desc

Clean up resources

When you finish the tutorial, delete the resources you created to avoid incurring extra costs. To delete the resources, follow these steps:

Browse to your workspace homepage.
Delete the notebook you created in this tutorial.
Delete the eventhouse or database used in this tutorial.

Feedback

Was this page helpful?

Last updated on 2026-06-23