How to search across multiple vector indexes in PostgreSQL with PGVector

Arik Levy 20 Reputation points
2024-11-13T15:26:58.6033333+00:00

TLDR: Is there a way to query both embedding columns in the same query, and ensure it uses the respective indexes e.g IVFFLAT and HNSW respectively in the same query?

I've created this table with pgvector extension enabled:

CREATE TABLE IF NOT EXISTS entities (
    id SERIAL PRIMARY KEY,       
    name TEXT,
    longitude DOUBLE PRECISION, 
    latitude DOUBLE PRECISION,  
    source TEXT,               
    name_embedding VECTOR(384),    
    geo_embedding VECTOR(32) 
);
--------------------------------------------------------
CREATE INDEX name_embedding_cosine_idx ON entities USING ivfflat (name_embedding vector_cosine_ops) WITH (lists = 10);
CREATE INDEX geo_embedding_hnsw_idx ON entities USING hnsw (geo_embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);

If I run these queries separetly:

EXPLAIN ANALYZE
SELECT id, source, name, latitude, longitude, name_embedding <=> '{name_embedding}' AS similarity
FROM entities
ORDER BY similarity ASC 
LIMIT 10;
EXPLAIN ANALYZE
SELECT id, source, name, latitude, longitude, (geo_embedding <-> '{geo_embedding}'::vector) AS similarity
FROM entities
ORDER BY similarity ASC 
LIMIT 10;

I can see I'm correctly using the indexes, however if I try to query both (and apply weights to it):

SELECT 
 id, source, name, latitude, longitude, 
  (name_embedding <=> '{name_embedding_str}'::vector) AS name_similarity,
  (geo_embedding <-> '{geo_embedding_str}'::vector) AS geo_similarity,
  (0.3 * (name_embedding <=> '{name_embedding_str}'::vector)) + 
  (0.8 * (geo_embedding <-> '{geo_embedding_str}'::vector)) AS combined_similarity
FROM 
  entities
ORDER BY 
  combined_similarity ASC
LIMIT 10;

It analyses the results sequentially, which works well for my POC with 9 records in it but will inevitably fail with hundreds of millions of records.

Is there a way to query both embedding columns in the same query, and ensure it uses the respective indexes e.g IVFFLAT and HNSW respectively in the same query?

Azure Database for PostgreSQL
{count} votes

Accepted answer
  1. Sina Salam 13,371 Reputation points
    2024-11-13T21:27:46.22+00:00

    Hello Arik Levy,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to know how to work with search across multiple vector indexes in PostgreSQL with PGVector.

    Regarding your questions, PostgreSQL does not natively support using multiple vector indexes (e.g., IVFFLAT and HNSW) in the same query when calculating a combined similarity score across different embedding columns. The query planner can only use one index per query for ordering purposes, which means the combined query will fall back to a sequential scan after computing the similarity for both embeddings.

    1. PostgreSQL’s query planner typically selects a single index for sorting the results when using ORDER BY. Even if both name_embedding and geo_embedding have separate indexes, PostgreSQL cannot combine the results of both indexes in a single query for the final ordering.
    2. The pgvector extension, though powerful, relies on PostgreSQL's planner, which does not yet optimize for multiple vector operations in the same query.

    PostgreSQL’s query planner doesn’t support combining two vector indexes in a single query efficiently. Using CTEs or similar approaches can help achieve the desired result while still leveraging the indexes. Due to query planner limitations, you can break down the query into two parts, use the indexes separately, and then combine the results.

    WITH name_results AS (
        SELECT 
            id, 
            source, 
            name, 
            latitude, 
            longitude, 
            (name_embedding <=> '{name_embedding_str}'::vector) AS name_similarity
        FROM 
            entities
        ORDER BY 
            name_similarity ASC
        LIMIT 100 -- Adjust this as necessary
    ),
    geo_results AS (
        SELECT 
            id, 
            (geo_embedding <-> '{geo_embedding_str}'::vector) AS geo_similarity
        FROM 
            entities
        ORDER BY 
            geo_similarity ASC
        LIMIT 100 -- Adjust this as necessary
    )
    SELECT 
        e.id, 
        e.source, 
        e.name, 
        e.latitude, 
        e.longitude, 
        nr.name_similarity, 
        gr.geo_similarity,
        (0.3 * nr.name_similarity) + (0.8 * gr.geo_similarity) AS combined_similarity
    FROM 
        name_results nr
    JOIN 
        geo_results gr
    ON 
        nr.id = gr.id
    JOIN 
        entities e 
    ON 
        e.id = nr.id
    ORDER BY 
        combined_similarity ASC
    LIMIT 10;
    

    How it will work:

    • Perform two separate queries to leverage the ivfflat and hnsw indexes effectively.
    • Use a JOIN on id to merge the results and calculate combined_similarity.
    • Use LIMIT in each CTE to reduce the number of rows considered.

    PostgreSQL will apply the indexes to each CTE query, using ivfflat for name_embedding and hnsw for geo_embedding. The final combination ensures a balance between the two similarity scores.

    If performance becomes an issue, you can alternative approaches:

    • Consider using tools like pg_bouncer or partitioned queries to improve parallelism.
    • Precompute similarity scores for frequent queries and store them in a materialized view.
    • Fine-tune the ivfflat and hnsw parameters (e.g., lists, m, ef_construction) for better performance on larger datasets.
    • Partition your dataset to reduce the number of rows per query.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.