Integrated data chunking and embedding in Azure AI Search

Important

Integrated data chunking and vectorization is in public preview under Supplemental Terms of Use. The 2023-10-01-Preview REST API provides this feature.

Integrated vectorization adds data chunking and text-to-vector conversions during indexing and at query time.

For data chunking and text-to-vector conversions during indexing, you need:

  • An indexer to retrieve data from a supported data source.
  • A skillset to call the Text Split skill to chunk the data.
  • The same skillset, calling an embedding model. The embedding model is accessed through the AzureOpenAIEmbedding skill, attached to text-embedding-ada-002 on Azure OpenAI, or a custom skill that points to another embedding model, for example any supported embedding model on OpenAI.
  • You also need a vector index to receive the chunked and vectorized content.

For text-to-vector queries:

  • A vectorizer defined in the index schema, assigned to a vector field, and used automatically at query time to convert a text query to a vector.
  • A query that specifies one or more vector fields.
  • A text string that's converted to a vector at query time.

Vector conversions are one-way: text-to-vector. There's no vector-to-text conversion for queries or results (for example, you can't convert a vector result to a human-readable string).

Component diagram

The following diagram shows the components of integrated vectorization.

Diagram of components in an integrated vectorization workflow.

Here's a checklist of the components responsible for integrated vectorization:

  • A supported data source for indexer-based indexing.
  • An index that specifies vector fields, and a vectorizer definition assigned to vector fields.
  • A skillset providing a Text Split skill for data chunking, and a skill for vectorization (either the AzureOpenAiEmbedding skill or a custom skill pointing to an external embedding model).
  • Optionally, index projections (also defined in a skillset) to push chunked data to a secondary index.
  • An embedding model, deployed on Azure OpenAI or available through an HTTP endpoint.
  • An indexer for driving the process end-to-end. An indexer also specifies a schedule, field mappings, and properties for change detection.

This checklist focuses on integrated vectorization, but your solution isn't limited to this list. You can add more skills for AI enrichment, create a knowledge store, add semantic ranking, add relevance tuning, and other query features.

Availability and pricing

Integrated vectorization is available in all regions and tiers. However, if you're using Azure OpenAI and the AzureOpenAIEmbedding skill, check regional availability of that service.

If you're using a custom skill and an Azure hosting mechanism (such as an Azure function app, Azure Web App, and Azure Kubernetes), check the product by region page for feature availability.

Data chunking (Text Split skill) is free and available on all Azure AI services in all regions.

Note

Some older search services created before January 1, 2019 are deployed on infrastructure that doesn't support vector workloads. If you try to add a vector field to a schema and get an error, it's a result of outdated services. In this situation, you must create a new search service to try out the vector feature.

What scenarios can integrated vectorization support?

  • Subdivide large documents into chunks, useful for vector and non-vector scenarios. For vectors, chunks help you meet the input constraints of embedding models. For non-vector scenarios, you might have a chat-style search app where GPT is assembling responses from indexed chunks. You can use vectorized or non-vectorized chunks for chat-style search.

  • Build a vector store where all of the fields are vector fields, and the document ID (required for a search index) is the only string field. Query the vector store to retrieve document IDs, and then send the document's vector fields to another model.

  • Combine vector and text fields for hybrid search, with or without semantic ranking. Integrated vectorization simplifies all of the scenarios supported by vector search.

When to use integrated vectorization

We recommend using the built-in vectorization support of Azure AI Studio. If this approach doesn't meet your needs, you can create indexers and skillsets that invoke integrated vectorization using the programmatic interfaces of Azure AI Search.

How to use integrated vectorization

For query-only vectorization:

  1. Add a vectorizer to an index. It should be the same embedding model used to generate vectors in the index.
  2. Assign the vectorizer to a vector profile, and then assign a vector profile to the vector field.
  3. Formulate a vector query that specifies the text string to vectorize.

A more common scenario - data chunking and vectorization during indexing:

  1. Create a data source connection to a supported data source for indexer-based indexing.
  2. Create a skillset that calls Text Split skill for chunking and AzureOpenAIEmbeddingModel or a custom skill to vectorize the chunks.
  3. Create an index that specifies a vectorizer for query time, and assign it to vector fields.
  4. Create an indexer to drive everything, from data retrieval, to skillset execution, through indexing.

Optionally, create secondary indexes for advanced scenarios where chunked content is in one index, and non-chunked in another index. Chunked indexes (or secondary indexes) are useful for RAG apps.

Tip

Try the new Import and vectorize data wizard in the Azure portal to explore integrated vectorization before writing any code.

Or, configure a Jupyter notebook to run the same workflow, cell by cell, to see how each step works.

Limitations

Make sure you know the Azure OpenAI quotas and limits for embedding models. Azure AI Search has retry policies, but if the quota is exhausted, retries fail.

Azure OpenAI token-per-minute limits are per model, per subscription. Keep this in mind if you're using an embedding model for both query and indexing workloads. If possible, follow best practices. Have an embedding model for each workload, and try to deploy them in different subscriptions.

On Azure AI Search, remember there are service limits by tier and workloads.

Finally, the following features aren't currently supported:

Benefits of integrated vectorization

Here are some of the key benefits of the integrated vectorization:

  • No separate data chunking and vectorization pipeline. Code is simpler to write and maintain.

  • Automate indexing end-to-end. When data changes in the source (such as in Azure Storage, Azure SQL, or Cosmos DB), the indexer can move those updates through the entire pipeline, from retrieval, to document cracking, through optional AI-enrichment, data chunking, vectorization, and indexing.

  • Projecting chunked content to secondary indexes. Secondary indexes are created as you would any search index (a schema with fields and other constructs), but they're populated in tandem with a primary index by an indexer. Content from each source document flows to fields in primary and secondary indexes during the same indexing run.

    Secondary indexes are intended for question and answer or chat style apps. The secondary index contains granular information for more specific matches, but the parent index has more information and can often produce a more complete answer. When a match is found in the secondary index, the query returns the parent document from the primary index. For example, assuming a large PDF as a source document, the primary index might have basic information (title, date, author, description), while a secondary index has chunks of searchable content.

Next steps