Index projections in Azure AI Search

Important

Index projections are in public preview under supplemental terms of use. It's available through the Azure portal, 2023-10-01-Preview REST APIs, Azure portal, and beta client libraries that have been updated to include the feature.

Index projections are a component of a skillset definition that defines the shape of a secondary index, supporting a one-to-many index pattern, where content from an enrichment pipeline can target multiple indexes.

Index projections take AI-enriched content generated by an enrichment pipeline and index it into a secondary index (different from the one that an indexer targets by default) on your search service. Index projections also allow you to reshape the data before indexing it, in a way that uniquely allows you to separate an array of enriched items into multiple search documents in the target index, otherwise known as "one-to-many" indexing. "One-to-many" indexing is useful for data chunking scenarios, where you might want a primary index for unchunked content and a secondary index for chunked.

If you've used cognitive skills in the past, you already know that skillsets create enriched content. Skillsets move a document through a sequence of enrichments that invoke atomic transformations, such as recognizing entities or translating text. By default, one document processed within a skillset maps to a single document in the search index. This means that if you perform chunking of an input text and then perform enrichments on each chunk, the result in the index when mapped via outputFieldMappings is an array of the generated enrichments. With index projections, you define a context at which to map each chunk of enriched data to its own search document. This allows you to apply a one-to-many mapping of a document's enriched data to the search index.

Index projections definition

Index projections are defined inside a skillset definition, and are primarily defined as an array of selectors, where each selector corresponds to a different target index on the search service. Each selector requires the following parameters as part of its definition:

  • targetIndexName: The name of the index on the search service that the index projection data index into.
  • parentKeyFieldName: The name of the field in the target index that contains the value of the key for the parent document.
  • sourceContext: The enrichment annotation that defines the granularity at which to map data into individual search documents. For more information, see Skill context and input annotation language.
  • mappings: An array of mappings of enriched data to fields in the search index. Each mapping consists of:
    • name: The name of the field in the search index that the data should be indexed into,
    • source: The enrichment annotation path that the data should be pulled from.

Each mapping can also recursively define data with an optional sourceContext and inputs field, similar to the knowledge store or Shaper Skill. These parameters allow you to shape data to be indexed into fields of type Edm.ComplexType in the search index.

The index defined in the targetIndexName parameter has the following requirements:

  • Must already have been created on the search service before the skillset containing the index projections definition is created.
  • Must contain a field with the name defined in the parentKeyFieldName parameter. This field must be of type Edm.String, can't be the key field, and must have filterable set to true.
  • The key field must have searchable set to true and be defined with the keyword analyzer.
  • Must have fields defined for each of the names defined in mappings, none of which can be the key field.

Here's an example payload for an index projections definition that you might use to project individual pages output by the Split skill as their own documents in the search index.

"indexProjections": {
    "selectors": [
        {
            "targetIndexName": "myTargetIndex",
            "parentKeyFieldName": "ParentKey",
            "sourceContext": "/document/pages/*",
            "mappings": [
                {
                    "name": "chunk",
                    "source": "/document/pages/*"
                }
            ]
        }
    ]
}

Handling parent documents

Because index projections effectively generate "child" documents for each "parent" document that runs through a skillset, you also have the following choices as to how to handle the indexing of the "parent" documents.

  • To keep parent and child documents in separate indexes, you would just ensure that the targetIndexName for your indexer definition is different from the targetIndexName defined in your index projection selector.

  • To index parent and child documents into the same index, you need to make sure that the schema for the target index works with both your defined fieldMappings and outputFieldMappings in your indexer definition and the mappings in your index projection selector. You would then just provide the same targetIndexName for your indexer definition and your index projection selector.

  • To ignore parent documents and only index child documents, you still need to provide a targetIndexName in your indexer definition (you can just provide the same one that you do for the index projection selector). Then define a separate parameters object next to your selectors definition with a projectionMode key set to skipIndexingParentDocuments, as shown here:

    "indexProjections": {
        "selectors": [
            ...
        ],
        "parameters": {
            "projectionMode": "skipIndexingParentDocuments"
        }
    }
    

REST API version 2023-10-01-Preview can be used to create index projections through additions to a skillset.

Content lifecycle

If the indexer data source supports change tracking and deletion detection, the indexing process can synchronize the primary and secondary indexes to pick up those changes.

Each time you run the indexer and skillset, the index projections are updated if the skillset or underlying source data has changed. Any changes picked up by the indexer are propagated through the enrichment process to the projections in the index, ensuring that your projected data is a current representation of content in the originating data source.

Note

While you can manually edit the data in the projected documents using the index push API, any edits will be overwritten on the next pipeline invocation, assuming the document in source data is updated.

Projected key value

Each index projection document contains a unique identifying key that the indexer generates in order to ensure uniqueness and allow for change and deletion tracking to work correctly. This key contains the following segments:

  • A random hash to guarantee uniqueness. This hash changes if the parent document is updated across indexer runs.
  • The parent document's key.
  • The enrichment annotation path that identifies the context that that document was generated from.

For example, if you split a parent document with key value "123" into four pages, and then each of those pages is projected as its own document via index projections, the key for the third page of text would look something like "01f07abfe7ed_123_pages_2". If the parent document is then updated to add a fifth page, the new key for the third page might, for example, be "9d800bdacc0e_123_pages_2", since the random hash value changes between indexer runs even though the rest of the projection data didn't change.

Changes or additions

If a parent document is changed such that the data within a projected index document changes (an example would be if a word was changed in a particular page but no net new pages were added), the data in the target index for that particular projection is updated to reflect that change.

If a parent document is changed such that there are new projected child documents that weren't there before (an example would be if one or more pages worth of text were added to the document), those new child documents are added next time the indexer runs.

In both of these cases, all projected documents are updated to have a new hash value in their key, regardless of if their particular content was updated.

Deletions

If a parent document is changed such that a child document generated by index projections no longer exists (an example would be if a text is shortened so there are fewer chunks than before), the corresponding child document in the search index is deleted. The remaining child documents also get their key updated to include a new hash value, even if their content didn't otherwise change.

If a parent document is completely deleted from the datasource, the corresponding child documents only get deleted if the deletion is detected by a dataDeletionDetectionPolicy defined on the datasource definition. If you don't have a dataDeletionDetectionPolicy configured and need to delete a parent document from the datasource, then you should manually delete the child documents if they're no longer wanted.