Map enriched output to fields in a search index in Azure AI Search

Raksts
07/30/2024

This article explains how to set up output field mappings, defining a data path between in-memory data generated during skillset processing, and target fields in a search index. During indexer execution, skills-generated information exists in memory only. To persist this information in a search index, you need to tell the indexer where to send the data.

An output field mapping is defined in an indexer and has the following elements:

"outputFieldMappings": [
  {
    "sourceFieldName": "document/path-to-a-node-in-an-enriched-document",
    "targetFieldName": "some-search-field-in-an-index",
    "mappingFunction": null
  }
],

In contrast with a fieldMappings definition that maps a path between verbatim source fields and index fields, an outputFieldMappings definition maps in-memory enrichments to fields in a search index.

Prerequisites

Indexer, index, data source, and skillset.
Index fields must be simple or top-level fields. You can't output to a complex type, but if you have a complex type, you can use an output field definition to flatten parts of the complex type and send them to a collection in a search index.

When to use an output field mapping

Output field mappings are required if your indexer has an attached skillset that creates new information that you want in your index. Examples include:

Vectors from embedding skills
OCR text from image skills
Locations, organizations, or people from entity recognition skills

Output field mappings can also be used to:

Create multiple copies of your generated content (one-to-many output field mappings).
Flatten a source document's complex type. For example, assume source documents have a complex type, such as a multipart address, and you want just the city. You can use an output field mapping to flatten a nested data structure, and then use an output field mapping to send the output to a string collection in your search index.

Output field mappings apply to search indexes only. If you're populating a knowledge store, use projections for data path configuration.

Define an output field mapping

Output field mappings are added to the outputFieldMappings array in an indexer definition, typically placed after the fieldMappings array. An output field mapping consists of three parts.

You can use the REST API or an Azure SDK to define output field mappings.

Tip

Indexers created by the Import data wizard include output field mappings generated by the wizard. If you need examples, run the wizard over your data source to see the output field mappings in the indexer.

REST APIs
.NET SDK (C#)

Use Create Indexer or Create or Update Indexer or an equivalent method in an Azure SDK. Here's an example of an indexer definition.

{
   "name": "myindexer",
   "description": null,
   "dataSourceName": "mydatasource",
   "targetIndexName": "myindex",
   "schedule": { },
   "parameters": { },
   "fieldMappings": [],
   "outputFieldMappings": [],
   "disabled": false,
   "encryptionKey": { }
 }

Fill out the outputFieldMappings array to specify the mappings. A field mapping consists of three parts.

"outputFieldMappings": [
  {
    "sourceFieldName": "/document/path-to-a-node-in-an-enriched-document",
    "targetFieldName": "some-search-field-in-an-index",
    "mappingFunction": null
  }
]

Property	Description
sourceFieldName	Required. Specifies a path to enriched content. An example might be `/document/content`. See Reference enrichments in an Azure AI Search skillset for path syntax and examples.
targetFieldName	Optional. Specifies the search field that receives the enriched content. Target fields must be top-level simple fields or collections. It can't be a path to a subfield in a complex type. If you want to retrieve specific nodes in a complex structure, you can flatten individual nodes in memory, and then send the output to a string collection in your index.
mappingFunction	Optional. Adds extra processing provided by mapping functions supported by indexers. For enrichment nodes, encoding and decoding are the most commonly used functions.

The targetFieldName is always the name of the field in the search index.

The sourceFieldName is a path to a node in the enriched document. It's the output of a skill. The path always starts with /document, and if you're indexing from a blob, the second element of the path is /content. The third element is the value produced by the skill. For more information and examples, see Reference enrichments in an Azure AI Search skillset.

This example adds entities and sentiment labels extracted from a blob's content property to fields in a search index.

{
    "name": "myIndexer",
    "dataSourceName": "myDataSource",
    "targetIndexName": "myIndex",
    "skillsetName": "myFirstSkillSet",
    "fieldMappings": [],
    "outputFieldMappings": [
        {
            "sourceFieldName": "/document/content/organizations/*/description",
            "targetFieldName": "descriptions",
            "mappingFunction": {
                "name": "base64Decode"
            }
        },
        {
            "sourceFieldName": "/document/content/organizations",
            "targetFieldName": "orgNames"
        },
        {
            "sourceFieldName": "/document/content/sentiment",
            "targetFieldName": "sentiment"
        }
    ]
}

Assign any mapping functions needed to transform the content of a field before it's stored in the index. For enrichment nodes, encoding and decoding are the most commonly used functions.

In the Azure SDK for .NET, use the OutputFieldMappingEntry class that provides "Name" and "TargetFieldName" properties and an optional "MappingFunction" reference.

Specify output field mappings when constructing the indexer, or later by directly setting SearchIndexer.OutputFieldMappings. The following C# example sets the output field mappings when constructing an indexer.

string indexerName = "cog-search-demo";
SearchIndexer indexer = new SearchIndexer(
    indexerName,
    dataSourceConnectionName,
    indexName)
{
    // Field mappings omitted for this example (assume default mappings)
    OutputFieldMappings =
    {
        new FieldMapping("/document/content/organizations") { TargetFieldName = "orgNames" },
        new FieldMapping("/document/content/sentiment") { TargetFieldName = "sentiment" }
    },
    SkillsetName = skillsetName
};

await indexerClient.CreateIndexerAsync(indexer);

One-to-many output field mapping

You can use an output field mapping to route a single source field to multiple fields in a search index. You might do this for comparison testing or if you want fields with different attributes.

Assume a skillset that generates embeddings for a vector field, and an index that has multiple vector fields that vary by algorithm and compression settings. Within the indexer, map the embedding skill's output to each of the multiple vector fields in a search index.

"outputFieldMappings": [
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_hnsw" }, 
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_eknn" },
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_narrow" }, 
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_no_stored" },
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_scalar" }       
  ]

The source field path is skill output. In this example, the output is text_vector. Target name is an optional property. If you don't give the output mapping a target name, the path would be embedding or more precisely, /document/content/embedding.

{
  "name": "test-vector-size-ss",  
  "description": "Generate embeddings using AOAI",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#1",
      "description": null,
      "context": "/document/content",
      "resourceUri": "https://my-demo-eastus.openai.azure.com",
      "apiKey": null,
      "deploymentId": "text-embedding-ada-002",
      "dimensions": 1536,
      "modelName": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "authIdentity": null
    }
  ]
}

Flatten complex structures into a string collection

If your source data is composed of nested or hierarchical JSON, you can't use field mappings to set up the data paths. Instead, your search index must mirror the source data structure for at each level for a full import.

This section walks you through an import process that produces a one-to-one reflection of a complex document on both the source and target sides. Next, it uses the same source document to illustrate the retrieval and flattening of individual nodes into string collections.

Here's an example of a document in Azure Cosmos DB with nested JSON:

{
   "palette":"primary colors",
   "colors":[
      {
         "name":"blue",
         "medium":[
            "acrylic",
            "oil",
            "pastel"
         ]
      },
      {
         "name":"red",
         "medium":[
            "acrylic",
            "pastel",
            "watercolor"
         ]
      },
      {
         "name":"yellow",
         "medium":[
            "acrylic",
            "watercolor"
         ]
      }
   ]
}

If you wanted to fully index the above source document, you'd create an index definition where the field names, levels, and types are reflected as a complex type. Because field mappings aren't supported for complex types in the search index, your index definition must mirror the source document.

{
  "name": "my-test-index",
  "defaultScoringProfile": "",
  "fields": [
    { "name": "id", "type": "Edm.String", "searchable": false, "retrievable": true, "key": true},
    { "name": "palette", "type": "Edm.String", "searchable": true, "retrievable": true },
    { "name": "colors", "type": "Collection(Edm.ComplexType)",
      "fields": [
        {
          "name": "name",
          "type": "Edm.String",
          "searchable": true,
          "retrievable": true
        },
        {
          "name": "medium",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "retrievable": true,
        }
      ]
    }
  ]
}

Here's a sample indexer definition that executes the import (notice there are no field mappings and no skillset).

{
  "name": "my-test-indexer",
  "dataSourceName": "my-test-ds",
  "skillsetName": null,
  "targetIndexName": "my-test-index",

  "fieldMappings": [],
  "outputFieldMappings": []
}

The result is the following sample search document, similar to the original in Azure Cosmos DB.

{
  "value": [
    {
      "@search.score": 1,
      "id": "240a98f5-90c9-406b-a8c8-f50ff86f116c",
      "palette": "primary colors",
      "colors": [
        {
          "name": "blue",
          "medium": [
            "acrylic",
            "oil",
            "pastel"
          ]
        },
        {
          "name": "red",
          "medium": [
            "acrylic",
            "pastel",
            "watercolor"
          ]
        },
        {
          "name": "yellow",
          "medium": [
            "acrylic",
            "watercolor"
          ]
        }
      ]
    }
  ]
}

An alternative rendering in a search index is to flatten individual nodes in the source's nested structure into a string collection in a search index.

To accomplish this task, you'll need an outputFieldMappings that maps an in-memory node to a string collection in the index. Although output field mappings primarily apply to skill outputs, you can also use them to address nodes after "document cracking" where the indexer opens a source document and reads it into memory.

Below is a sample index definition, using string collections to receive flattened output:

{
  "name": "my-new-flattened-index",
  "defaultScoringProfile": "",
  "fields": [
    { "name": "id", "type": "Edm.String", "searchable": false, "retrievable": true, "key": true },
    { "name": "palette", "type": "Edm.String", "searchable": true, "retrievable": true },
    { "name": "color_names", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true },
    { "name": "color_mediums", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true}
  ]
}

Here's the sample indexer definition, using outputFieldMappings to associate the nested JSON with the string collection fields. Notice that the source field uses the path syntax for enrichment nodes, even though there's no skillset. Enriched documents are created in the system during document cracking, which means you can access nodes in each document tree as long as those nodes exist when the document is cracked.

{
  "name": "my-test-indexer",
  "dataSourceName": "my-test-ds",
  "skillsetName": null,
  "targetIndexName": "my-new-flattened-index",
  "parameters": {  },
  "fieldMappings": [   ],
  "outputFieldMappings": [
    {
       "sourceFieldName": "/document/colors/*/name",
       "targetFieldName": "color_names"
    },
    {
       "sourceFieldName": "/document/colors/*/medium",
       "targetFieldName": "color_mediums"
    }
  ]
}

Results from the above definition are as follows. Simplifying the structure loses context in this case. There's no longer any associations between a given color and the mediums it's available in. However, depending on your scenario, a result similar to the one shown below might be exactly what you need.

{
  "value": [
    {
      "@search.score": 1,
      "id": "240a98f5-90c9-406b-a8c8-f50ff86f116c",
      "palette": "primary colors",
      "color_names": [
        "blue",
        "red",
        "yellow"
      ],
      "color_mediums": [
        "[\"acrylic\",\"oil\",\"pastel\"]",
        "[\"acrylic\",\"pastel\",\"watercolor\"]",
        "[\"acrylic\",\"watercolor\"]"
      ]
    }
  ]
}

Kopīgot, izmantojot