Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The pg_fts extension adds production-quality, BM25-ranked full-text search to Azure HorizonDB. BM25 is the same relevance algorithm used by Elasticsearch, Solr, and Azure AI Search - pg_fts brings it inside Postgres as a custom index, so you can do keyword search natively next to your relational data without standing up a separate search service or copy-syncing data into one.
pg_fts is the recommended full-text search option on Azure HorizonDB. It works on its own, and it composes with pgvector and DiskANN to power hybrid search.
Note
pg_fts is in preview.
When to use pg_fts vs. built-in tsvector
PostgreSQL has had built-in full-text search through tsvector and tsquery for years. pg_fts doesn't replace that - it solves the cases where built-in FTS falls short:
| Need | Built-in tsvector + GIN |
pg_fts |
|---|---|---|
| Ranking algorithm | ts_rank - no term saturation, no length normalization, no native IDF |
BM25 - industry-standard ranker with all three |
| Latency at 100K+ rows on multi-keyword queries | Often hundreds of ms to seconds | Single-digit to low double-digit ms |
| Scale to billions of documents | Degrades - GIN posting lists grow large | Designed for scale via a custom index |
| Fuzzy / typo tolerance | Manual pg_trgm plumbing |
First-class fuzzy queries |
| Phrase proximity (words within N positions) | Limited | First-class |
| CJK languages | Requires custom dictionaries | Built-in analyzers for Chinese, Japanese, Korean, Thai |
If you have a small, low-traffic search workload and you're already happy with ts_rank, the built-in path is fine. For anything closer to a real search experience - product catalog, support content, log search, agent retrieval - use pg_fts.
Why BM25
BM25 (Best Matching 25) solves three problems that ts_rank doesn't:
- Term frequency saturation. Repeated occurrences of a keyword have diminishing returns, so a keyword-stuffed document can't dominate results.
- Document length normalization. A short product title that mentions "wireless headphones" outranks a 10,000-word blog post that happens to mention the same phrase once.
- Inverse document frequency (IDF). Common words (
the,error) get down-weighted; rare, discriminating terms ("PG-4012," "replication") get up-weighted.
That's why every modern search engine uses BM25 (or a close variant) as its baseline. With pg_fts, you get the same quality without leaving Postgres.
Enable pg_fts
To use the pg_fts extension, allow the extension at the instance level, then create the extension on each database where you want to use it.
CREATE EXTENSION IF NOT EXISTS pg_fts;
For convenient access to the functions and operators, add the pgfts schema to your search_path for the session:
SET search_path = public, pgfts;
To verify the install:
SELECT pgfts.hello_pg_fts();
To remove the extension from the current database:
DROP EXTENSION IF EXISTS pg_fts;
Create a full-text search index
pg_fts exposes a custom index access method called fts. Create an index on one or more text columns:
CREATE TABLE products (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL,
description TEXT NOT NULL
);
CREATE INDEX idx_products_fts
ON products
USING fts (name, description);
Key differences from a GIN index on tsvector:
- The data column stays as plain
text- notsvectorcolumn to maintain. - The index updates automatically on
INSERT,UPDATE, andDELETE. NoREFRESHstep. - Because results come from a custom scan,
pg_stat_user_indexesdoesn't show the index the same way GIN does.
Run searches
Basic keyword search
pgfts.fts_query() is a boolean filter that names the index to search. Results come back in BM25 rank order automatically.
SELECT id, name, description
FROM products
WHERE pgfts.fts_query('wireless noise cancelling headphones', 'idx_products_fts')
LIMIT 10;
Project the BM25 score
Use pgfts.fts_score() to surface the relevance score alongside the row. This function requires pg_fts in shared_preload_libraries.
SELECT id, name, description,
pgfts.fts_score(description) AS score
FROM products
WHERE pgfts.fts_query('wireless headphones', 'idx_products_fts')
ORDER BY score DESC
LIMIT 10;
Boolean queries (AND / OR / NOT)
SELECT id, name
FROM products
WHERE pgfts.fts_query('wireless AND headphones NOT earbuds', 'idx_products_fts')
LIMIT 10;
Fuzzy search for typo tolerance
Use the JSON DSL to match terms within an edit distance of 0, 1, or 2. This approach handles real-world misspellings without needing to add pg_trgm.
SELECT id, name
FROM products
WHERE pgfts.fts_query(
'{"fuzzy": {"description": {"value": "headhpones", "fuzziness": 1}}}'::jsonb,
'idx_products_fts')
LIMIT 10;
Phrase proximity
Find words that appear within N positions of each other.
-- Exact adjacent phrase
SELECT id, name
FROM products
WHERE pgfts.fts_query('"noise cancelling"~0', 'idx_products_fts');
-- Words within 5 positions of each other
SELECT id, name
FROM products
WHERE pgfts.fts_query('"wireless headphones"~5', 'idx_products_fts');
The @@? operator
For simple, single-keyword filters on a text column, use the @@? operator directly. It doesn't support boolean syntax - use pgfts.fts_query() for AND, OR, and NOT.
SELECT id, name
FROM products
WHERE description OPERATOR(pgfts.@@?) 'wireless headphones';
Multi-language support
pg_fts includes analyzers for major non-Latin-script languages.
| Analyzer | Language | Description |
|---|---|---|
default |
Multiple | Simple tokenizer with lowercase filter, suitable for English, and most Latin-script languages |
chinese |
Chinese | Jieba segmentation |
japanese |
Japanese | Lindera with IPADIC dictionary |
korean |
Korean | Lindera with mecab-ko-dic dictionary |
thai |
Thai | ICU4X word segmentation |
You can inspect tokenization for any analyzer by using the debug helper:
SELECT *
FROM pgfts.debug_analyze_text('japanese', '{}', '東京の天気');
To list available analyzers:
SELECT * FROM pgfts.list_fts_analyzers();
Combine pg_fts with vector search (hybrid search)
pg_fts is designed to work with vector search. The standard pattern is Reciprocal Rank Fusion (RRF): run BM25 and vector search separately, then combine the ranks.
WITH bm25_results AS (
SELECT id, ROW_NUMBER() OVER () AS bm25_rank
FROM products
WHERE pgfts.fts_query('wireless noise cancelling', 'idx_products_fts')
LIMIT 20
),
vector_results AS (
SELECT id,
ROW_NUMBER() OVER (
ORDER BY embedding <=> azure_openai.create_embeddings(
'text-embedding-3-small',
'<query>')::vector
) AS vec_rank
FROM products
ORDER BY embedding <=> azure_openai.create_embeddings(
'text-embedding-3-small',
'<query>')::vector
LIMIT 20
)
SELECT COALESCE(b.id, v.id) AS id,
(1.0 / (60 + COALESCE(b.bm25_rank, 999))) +
(1.0 / (60 + COALESCE(v.vec_rank, 999))) AS rrf_score
FROM bm25_results b
FULL OUTER JOIN vector_results v ON b.id = v.id
ORDER BY rrf_score DESC
LIMIT 10;
For an end-to-end walkthrough - including embedding generation in SQL and adding a semantic reranker - see Hybrid search in Azure HorizonDB (Preview).
Performance notes
- LIMIT pushdown. The
pg_ftscustom scan pushesLIMITinto the index and only retrieves as many candidates as you ask for. This feature makes multi-keyword queries fast on large tables. - Index size. The
ftsindex is denser than a GIN index overtsvectorbecause it stores positions, frequencies, and language-specific analyzer state. Plan disk accordingly. - Updates. The index continuously applies inserts and updates. There's no separate refresh step.
ORDER BY score. When you order bypgfts.fts_score(), the planner still uses the FTS custom scan - it doesn't rerank the whole table.
Limitations (Preview)
pgfts.fts_score()requirespg_ftsinshared_preload_libraries. Without it, onlypgfts.fts_query()(which already returns rows in rank order) works.- The
@@?operator doesn't support boolean (AND/OR/NOT) syntax. Usepgfts.fts_query()for those queries. - You can inspect CJK analyzers by using
pgfts.debug_analyze_text(), but you can't yet select them at index creation time via aWITH (analyzer = '...')option. - The index isn't represented in
pg_stat_user_indexesthe same way GIN is, because results come through a custom scan.