Best practices for Mosaic AI Vector Search
This article gives some tips for how to use Mosaic AI Vector Search most effectively.
Recommendations for optimizing latency
- Use the service principal authorization flow to take advantage of network-optimized routes.
- Use the latest version of the Python SDK.
- When testing, start with a concurrency of around 16 to 32. Higher concurrency does not yield a higher throughput.
- Use a model served with provisioned throughput (for example, bge-large-en or a fine tuned version), instead of a pay-per-token foundation model.
When to use GPUs
- Use CPUs only for basic testing and for small datasets (up to 100s of rows).
- For GPU compute type, Databricks recommends using GPU-small or GPU-medium.
- For GPU compute scale-out, choosing more concurrency might improve ingestion times, but it depends on factors such as total dataset size and index metadata.
Working with images, video, or non-text data
- Pre-compute the embeddings and use a Delta Sync Index with self-managed embeddings.
- Don’t store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.
Embedding sequence length
- Check the embedding model sequence length to make sure documents are not being truncated. For example, BGE supports a context of 512 tokens. For longer context requirements, use gte-large-en-v1.5.
Use Triggered sync mode to reduce costs
- The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.