Best practices for Mosaic AI Vector Search

This article gives some tips for how to use Mosaic AI Vector Search most effectively.

Recommendations for optimizing latency

  • Use the service principal authorization flow to take advantage of network-optimized routes.
  • Use the latest version of the Python SDK.
  • When testing, start with a concurrency of around 16 to 32. Higher concurrency does not yield a higher throughput.
  • Use a model served with provisioned throughput (for example, bge-large-en or a fine tuned version), instead of a pay-per-token foundation model.

When to use GPUs

  • Use CPUs only for basic testing and for small datasets (up to 100s of rows).
  • For GPU compute type, Databricks recommends using GPU-small or GPU-medium.
  • For GPU compute scale-out, choosing more concurrency might improve ingestion times, but it depends on factors such as total dataset size and index metadata.

Working with images, video, or non-text data

  • Pre-compute the embeddings and use a Delta Sync Index with self-managed embeddings.
  • Don’t store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.

Embedding sequence length

  • Check the embedding model sequence length to make sure documents are not being truncated. For example, BGE supports a context of 512 tokens. For longer context requirements, use gte-large-en-v1.5.

Use Triggered sync mode to reduce costs

  • The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.