Azure AI Vision multimodal embeddings skill

Important

This feature is in public preview under Supplemental Terms of Use. The 2024-05-01-Preview REST API supports this feature.

The Azure AI Vision multimodal embeddings skill uses Azure AI Vision's multimodal embeddings API to generate embeddings for image or text input.

The skill is only supported in search services located in a region that supports the Azure AI Vision Multimodal embeddings API. Currently this is East US, France Central, Korea Central, North Europe, Southeast Asia, West Europe, and West US. Your data is processed in the Geo where your model is deployed.

Note

This skill is bound to Azure AI services and requires a billable resource for transactions that exceed 20 documents per indexer per day. Execution of built-in skills is charged at the existing Azure AI services pay-as-you go price.

In addition, image extraction is billable by Azure AI Search.

@odata.type

Microsoft.Skills.Vision.VectorizeSkill

Data limits

The input limits for the skill can be found in the Azure AI Vision documentation for images and text respectively. Consider using the Text Split skill if you need data chunking for text inputs.

Skill parameters

Parameters are case-sensitive.

Inputs Description
modelVersion (Required) The model version to be passed to the Azure AI Vision multimodal embeddings API for generating embeddings. It is important that all embeddings stored in a given index field are generated using the same modelVersion.

Skill inputs

Input Description
text The input text to be vectorized. If you're using data chunking, the source might be /document/pages/*.
image Complex Type. Currently only works with "/document/normalized_images" field, produced by the Azure blob indexer when imageAction is set to a value other than none.
url The URL to download the image to be vectorized.
queryString The query string of the URL to download the image to be vectorized. Useful if you store the URL and SAS token in separate paths.

Only one of text, image or url/queryString can be configured for a single instance of the skill. If you want to vectorize both images and text within the same skillset, include two instances of this skill in the skillset definition, one for each input type you would like to use.

Skill outputs

Output Description
vector Output embedding array of floats for the input text or image.

Sample definition

For text input, consider a record that has the following fields:

{
    "content": "Microsoft released Windows 10."
}

Then your skill definition might look like this:

{ 
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill", 
    "context": "/document", 
    "modelVersion": "2023-04-15", 
    "inputs": [ 
        { 
            "name": "text", 
            "source": "/document/content" 
        } 
    ], 
    "outputs": [ 
        { 
            "name": "vector"
        } 
    ] 
} 

For image input, your skill definition might look like this:

{
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
    "context": "/document/normalized_images/*",
    "modelVersion": "2023-04-15", 
    "inputs": [
        {
            "name": "image",
            "source": "/document/normalized_images/*"
        }
    ],
    "outputs": [
        {
            "name": "vector"
        }
    ]
}

If you want to vectorize images directly from your blob storage datasource, your skill definition might look like this:

{
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
    "context": "/document",
    "modelVersion": "2023-04-15", 
    "inputs": [
        {
            "name": "url",
            "source": "/document/metadata_storage_path"
        },
        {
            "name": "queryString",
            "source": "/document/metadata_storage_sas_token"
        }
    ],
    "outputs": [
        {
            "name": "vector"
        }
    ]
}

Sample output

For the given input text, a vectorized embedding output is produced.

{
  "vector": [
        0.018990106880664825,
        -0.0073809814639389515,
        .... 
        0.021276434883475304,
      ]
}

The output resides in memory. To send this output to a field in the search index, you must define an outputFieldMapping that maps the vectorized embedding output (which is an array) to a vector field. Assuming the skill output resides in the document's vector node, and content_vector is the field in the search index, the outputFieldMapping in indexer should look like:

  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/vector/*",
      "targetFieldName": "content_vector"
    }
  ]

For mapping image embeddings to the index, you will need to use the Index Projections feature. The payload for indexProjections might look something like this:

"indexProjections": {
    "selectors": [
        {
            "targetIndexName": "myTargetIndex",
            "parentKeyFieldName": "ParentKey",
            "sourceContext": "/document/normalized_images/*",
            "mappings": [
                {
                    "name": "content_vector",
                    "source": "/document/normalized_images/*/vector"
                }
            ]
        }
    ]
}

See also