Edit

Share via


Vector databases

A vector database stores and manages data in the form of vectors, which are numerical arrays of data points.

The use of vectors allows for complex queries and analyses, because you can compare and analyze vectors by using advanced techniques such as vector similarity search, quantization, and clustering. Traditional databases aren't well-suited for handling the high-dimensional data that's becoming increasingly common in data analytics. However, vector databases are designed to handle high-dimensional data, such as text, images, and audio, by representing them as vectors. Vector databases are useful for tasks such as machine learning, natural language processing, and image recognition, where the goal is to identify patterns or similarities in large datasets.

This article gives some background about vector databases and explains conceptually how you can use an Eventhouse as a vector database in Real-Time Intelligence in Microsoft Fabric. For a practical example, see Tutorial: Use an Eventhouse as a vector database.

Key concepts

The following key concepts are used in vector databases:

Vector similarity

Vector similarity is a measure of how different (or similar) two or more vectors are. Vector similarity search is a technique used to find similar vectors in a dataset. You compare vectors by using a distance metric, such as Euclidean distance or cosine similarity. The closer two vectors are, the more similar they are.

Embeddings

Embeddings are a common way of representing data in a vector format for use in vector databases. An embedding is a mathematical representation of a piece of data, such as a word, text document, or an image, that is designed to capture its semantic meaning. You create embeddings by using algorithms that analyze the data and generate a set of numerical values that represent its key features. For example, an embedding for a word might represent its meaning, its context, and its relationship to other words. The process of creating embeddings is straightforward. While you can create them by using standard Python packages (for example, spaCy, sent2vec, Gensim), Large Language Models (LLM) generate highest quality embeddings for semantic text search. For example, you can send text to an embedding model in Azure OpenAI, and it generates a vector representation that you can store for analysis. For more information, see Understand embeddings in Azure OpenAI Service.

General workflow

Schematic of how to embed, store, and query text stored as vectors.

The general workflow for using a vector database is as follows:

  1. Embed data: Convert data into vector format using an embedding model. For example, you can embed text data using an OpenAI model.
  2. Store vectors: Store the embedded vectors in a vector database. You can send the embedded data to an Eventhouse to store and manage the vectors.
  3. Embed query: Convert the query data into vector format using the same embedding model used to embed the stored data.
  4. Query vectors: Use vector similarity search to find entries in the database that are similar to the query.

Eventhouse as a vector database

At the core of vector similarity search is the ability to store, index, and query vector data. Eventhouses provide a solution for handling and analyzing large volumes of data, particularly in scenarios requiring real-time analytics and exploration. This capability makes Eventhouse an excellent choice for storing and searching vectors.

The following components of the Eventhouse enable you to use it as a vector database:

  • The dynamic data type, which can store unstructured data such as arrays and property bags. Use this data type to store vector values. You can further augment the vector value by storing metadata related to the original object as separate columns in your table.
  • The encoding type Vector16 designed for storing vectors of floating-point numbers in 16-bit precision. This encoding uses the Bfloat16 instead of the default 64 bits. Use this encoding to store ML vector embeddings because it reduces storage requirements by a factor of four and accelerates vector processing functions such as series_dot_product() and series_cosine_similarity() by orders of magnitude.
  • The series_cosine_similarity function, which you can use to perform vector similarity searches on top of the vectors stored in Eventhouse.

Optimize for scale

For more information on optimizing vector similarity search, see the blog.

To maximize performance and the resulting search times, follow these steps:

  1. Set the encoding of the embeddings column to Vector16, the 16-bit encoding of the vectors coefficients (instead of the default 64-bit).
  2. Store the embedding vectors table on all cluster nodes with at least one shard per processor. To do this goal, follow these steps:
    1. Limit the number of embedding vectors per shard by altering the ShardEngineMaxRowCount of the sharding policy. The sharding policy balances data on all nodes with multiple extents per node so the search can use all available processors.
    2. Change the RowCountUpperBoundForMerge of the merging policy. The merge policy is needed to suppress merging extents after ingestion.

Example optimization steps

In the following example, you define a static vector table for storing 1M vectors. You define the embedding policy as Vector16, and set the sharding and merging policies to optimize the table for vector similarity search. For this example, assume the cluster has 20 nodes and each node has 16 processors. The table’s shards should contain at most 1,000,000/(20*16)=3,125 rows.

  1. Run the following KQL commands one by one to create the empty table and set the required policies and encoding:

    .create table embedding_vectors(vector_id:long, vector:dynamic)                                  //  This is a sample selection of columns, you can add more columns
    
    .alter column embedding_vectors.vector policy encoding type = 'Vector16'                         // Store the coefficients in 16 bits instead of 64 bits accelerating calculation of dot product, suppress redundant indexing
    
    .alter-merge table embedding_vectors policy sharding '{ "ShardEngineMaxRowCount" : 3125 }'       // Balanced data on all nodes and, multiple extents per node so the search can use all processors 
    
    .alter-merge table embedding_vectors policy merge '{ "RowCountUpperBoundForMerge" : 3125 }'      // Suppress merging extents after ingestion
    
  2. Ingest the data to the table created and defined in the previous step.

Next step