Редактиране

Споделяне чрез


Map enriched output to fields in a search index in Azure AI Search

Diagram of the Indexer Stages with Output Field Mappings highlighted.

This article explains how to set up output field mappings, defining a data path between in-memory data generated during skillset processing, and target fields in a search index. During indexer execution, skills-generated information exists in memory only. To persist this information in a search index, you need to tell the indexer where to send the data.

An output field mapping is defined in an indexer and has the following elements:

"outputFieldMappings": [
  {
    "sourceFieldName": "document/path-to-a-node-in-an-enriched-document",
    "targetFieldName": "some-search-field-in-an-index",
    "mappingFunction": null
  }
],

In contrast with a fieldMappings definition that maps a path between verbatim source fields and index fields, an outputFieldMappings definition maps in-memory enrichments to fields in a search index.

Prerequisites

  • Indexer, index, data source, and skillset.

  • Index fields must be simple or top-level fields. You can't output to a complex type. However, if you have a complex type, you can use an output field definition to flatten parts of the complex type and send them to a collection in a search index.

When to use an output field mapping

Output field mappings are required if your indexer has an attached skillset that creates new information that you want in your index. Examples include:

  • Vectors from embedding skills
  • Optical character recognition (OCR) text from image skills
  • Locations, organizations, or people from entity recognition skills

Output field mappings can also be used to:

  • Create multiple copies of your generated content (one-to-many output field mappings).

  • Flatten a source document's complex type. For example, assume source documents have a complex type, such as a multipart address, and you want just the city. You can use an output field mapping to flatten a nested data structure, and then use an output field mapping to send the output to a string collection in your search index.

Output field mappings apply to search indexes only. If you're populating a knowledge store, use projections for data path configuration.

Define an output field mapping

Output field mappings are added to the outputFieldMappings array in an indexer definition, typically placed after the fieldMappings array. An output field mapping consists of three parts.

You can use the REST API or an Azure SDK to define output field mappings.

Tip

Indexers created by the Import data wizard include output field mappings generated by the wizard. If you need examples, run the wizard over your data source to see the output field mappings in the indexer.

  1. Use Create Indexer or Create or Update Indexer or an equivalent method in an Azure SDK. Here's an example of an indexer definition.

    {
       "name": "myindexer",
       "description": null,
       "dataSourceName": "mydatasource",
       "targetIndexName": "myindex",
       "schedule": { },
       "parameters": { },
       "fieldMappings": [],
       "outputFieldMappings": [],
       "disabled": false,
       "encryptionKey": { }
     }
    
  2. Fill out the outputFieldMappings array to specify the mappings. A field mapping consists of three parts.

    "outputFieldMappings": [
      {
        "sourceFieldName": "/document/path-to-a-node-in-an-enriched-document",
        "targetFieldName": "some-search-field-in-an-index",
        "mappingFunction": null
      }
    ]
    
    Property Description
    sourceFieldName Required. Specifies a path to enriched content. An example might be /document/content. See Reference enrichments in an Azure AI Search skillset for path syntax and examples.
    targetFieldName Optional. Specifies the search field that receives the enriched content. Target fields must be top-level simple fields or collections. It can't be a path to a subfield in a complex type. If you want to retrieve specific nodes in a complex structure, you can flatten individual nodes in memory, and then send the output to a string collection in your index.
    mappingFunction Optional. Adds extra processing provided by mapping functions supported by indexers. For enrichment nodes, encoding and decoding are the most commonly used functions.
  3. The targetFieldName is always the name of the field in the search index.

  4. The sourceFieldName is a path to a node in the enriched document. It's the output of a skill. The path always starts with /document, and if you're indexing from a blob, the second element of the path is /content. The third element is the value produced by the skill. For more information and examples, see Reference enrichments in an Azure AI Search skillset.

    This example adds entities and sentiment labels extracted from a blob's content property to fields in a search index.

    {
        "name": "myIndexer",
        "dataSourceName": "myDataSource",
        "targetIndexName": "myIndex",
        "skillsetName": "myFirstSkillSet",
        "fieldMappings": [],
        "outputFieldMappings": [
            {
                "sourceFieldName": "/document/content/organizations/*/description",
                "targetFieldName": "descriptions",
                "mappingFunction": {
                    "name": "base64Decode"
                }
            },
            {
                "sourceFieldName": "/document/content/organizations",
                "targetFieldName": "orgNames"
            },
            {
                "sourceFieldName": "/document/content/sentiment",
                "targetFieldName": "sentiment"
            }
        ]
    }
    
  5. Assign any mapping functions needed to transform the content of a field before it's stored in the index. For enrichment nodes, encoding and decoding are the most commonly used functions.

One-to-many output field mapping

You can use an output field mapping to route a single source field to multiple fields in a search index. You might do this for comparison testing or if you want fields with different attributes.

Assume a skillset that generates embeddings for a vector field, and an index that has multiple vector fields that vary by algorithm and compression settings. Within the indexer, map the embedding skill's output to each of the multiple vector fields in a search index.

"outputFieldMappings": [
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_hnsw" }, 
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_eknn" },
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_narrow" }, 
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_no_stored" },
    { "sourceFieldName" : "/document/content/text_vector", "targetFieldName" : "vector_scalar" }       
  ]

The source field path is skill output. In this example, the output is text_vector. Target name is an optional property. If you don't give the output mapping a target name, the path would be embedding or more precisely, /document/content/embedding.

{
  "name": "test-vector-size-ss",  
  "description": "Generate embeddings using AOAI",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#1",
      "description": null,
      "context": "/document/content",
      "resourceUri": "https://my-demo-eastus.openai.azure.com",
      "apiKey": null,
      "deploymentId": "text-embedding-ada-002",
      "dimensions": 1536,
      "modelName": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "authIdentity": null
    }
  ]
}

Flatten complex structures into a string collection

If your source data is composed of nested or hierarchical JSON, you can't use field mappings to set up the data paths. Instead, your search index must mirror the source data structure for at each level for a full import.

This section walks you through an import process that produces a one-to-one reflection of a complex document on both the source and target sides. Next, it uses the same source document to illustrate the retrieval and flattening of individual nodes into string collections.

Here's an example of a document in Azure Cosmos DB with nested JSON:

{
   "palette":"primary colors",
   "colors":[
      {
         "name":"blue",
         "medium":[
            "acrylic",
            "oil",
            "pastel"
         ]
      },
      {
         "name":"red",
         "medium":[
            "acrylic",
            "pastel",
            "watercolor"
         ]
      },
      {
         "name":"yellow",
         "medium":[
            "acrylic",
            "watercolor"
         ]
      }
   ]
}

If you wanted to fully index this source document, you'd create an index definition where the field names, levels, and types are reflected as a complex type. Because field mappings aren't supported for complex types in the search index, your index definition must mirror the source document.

{
  "name": "my-test-index",
  "defaultScoringProfile": "",
  "fields": [
    { "name": "id", "type": "Edm.String", "searchable": false, "retrievable": true, "key": true},
    { "name": "palette", "type": "Edm.String", "searchable": true, "retrievable": true },
    { "name": "colors", "type": "Collection(Edm.ComplexType)",
      "fields": [
        {
          "name": "name",
          "type": "Edm.String",
          "searchable": true,
          "retrievable": true
        },
        {
          "name": "medium",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "retrievable": true,
        }
      ]
    }
  ]
}

Here's a sample indexer definition that executes the import. Notice there are no field mappings and no skillset.

{
  "name": "my-test-indexer",
  "dataSourceName": "my-test-ds",
  "skillsetName": null,
  "targetIndexName": "my-test-index",

  "fieldMappings": [],
  "outputFieldMappings": []
}

The result is the following sample search document, similar to the original in Azure Cosmos DB.

{
  "value": [
    {
      "@search.score": 1,
      "id": "11bb11bb-cc22-dd33-ee44-55ff55ff55ff",
      "palette": "primary colors",
      "colors": [
        {
          "name": "blue",
          "medium": [
            "acrylic",
            "oil",
            "pastel"
          ]
        },
        {
          "name": "red",
          "medium": [
            "acrylic",
            "pastel",
            "watercolor"
          ]
        },
        {
          "name": "yellow",
          "medium": [
            "acrylic",
            "watercolor"
          ]
        }
      ]
    }
  ]
}

An alternative rendering in a search index is to flatten individual nodes in the source's nested structure into a string collection in a search index.

To accomplish this task, you'll need an outputFieldMappings that maps an in-memory node to a string collection in the index. Although output field mappings primarily apply to skill outputs, you can also use them to address nodes after document cracking where the indexer opens a source document and reads it into memory.

The following sample index definition uses string collections to receive flattened output:

{
  "name": "my-new-flattened-index",
  "defaultScoringProfile": "",
  "fields": [
    { "name": "id", "type": "Edm.String", "searchable": false, "retrievable": true, "key": true },
    { "name": "palette", "type": "Edm.String", "searchable": true, "retrievable": true },
    { "name": "color_names", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true },
    { "name": "color_mediums", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true}
  ]
}

Here's the sample indexer definition, using outputFieldMappings to associate the nested JSON with the string collection fields. Notice that the source field uses the path syntax for enrichment nodes, even though there's no skillset. Enriched documents are created in the system during document cracking, which means you can access nodes in each document tree as long as those nodes exist when the document is cracked.

{
  "name": "my-test-indexer",
  "dataSourceName": "my-test-ds",
  "skillsetName": null,
  "targetIndexName": "my-new-flattened-index",
  "parameters": {  },
  "fieldMappings": [   ],
  "outputFieldMappings": [
    {
       "sourceFieldName": "/document/colors/*/name",
       "targetFieldName": "color_names"
    },
    {
       "sourceFieldName": "/document/colors/*/medium",
       "targetFieldName": "color_mediums"
    }
  ]
}

Results from the definition are as follows. Simplifying the structure loses context in this case. There's no longer any associations between a given color and the mediums it's available in. However, depending on your scenario, a result similar to the following example might be exactly what you need.

{
  "value": [
    {
      "@search.score": 1,
      "id": "11bb11bb-cc22-dd33-ee44-55ff55ff55ff",
      "palette": "primary colors",
      "color_names": [
        "blue",
        "red",
        "yellow"
      ],
      "color_mediums": [
        "[\"acrylic\",\"oil\",\"pastel\"]",
        "[\"acrylic\",\"pastel\",\"watercolor\"]",
        "[\"acrylic\",\"watercolor\"]"
      ]
    }
  ]
}