你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

教程:使用多模式嵌入和文档提取技能为混合内容编制索引

Azure AI 搜索可以从 Azure Blob 存储中存储的 PDF 文档中提取和索引文本和图像。 本教程介绍如何通过将文本和图像嵌入统一语义搜索索引来生成多模式索引管道。

在本教程中,你使用:

  • 一个包含 36 页的 PDF 文档,它将丰富的视觉内容(如图表、信息图和扫描的页面)与传统文本相结合。

  • 用于提取文本和规范化图像的文档提取技能

  • 使用 Azure AI 视觉多模式嵌入技能进行矢量化,该技能为文本和图像生成嵌入。

  • 配置用于存储文本和图像嵌入并支持基于矢量的相似性搜索的搜索索引。

本教程演示了使用文档提取技能和图像标题为多模式内容编制索引的低成本方法。 它支持从 Azure Blob 存储中的文档提取和搜索文本和图像。 但是,它不包括文本的位置元数据,例如页码或边界区域。

如需包含结构化文本布局和空间元数据的更全面解决方案,请参阅使用图像语言描述和文档布局技能为多模 RAG 方案中包含文本和图像的 Blob 编制索引

注释

根据本教程的要求将 imageAction 设置为 generateNormalizedImages 会产生额外图像提取费用,具体费用取决于 Azure AI 搜索定价

使用 REST 客户端和 搜索 REST API ,你将:

  • 设置示例数据并配置 azureblob 数据源
  • 创建支持文本和图像嵌入的索引
  • 使用提取和嵌入步骤定义技能集
  • 创建并运行索引器来处理和索引内容
  • 搜索刚刚创建的索引

先决条件

下载文件

下载以下示例 PDF:

将示例数据上传到 Azure 存储

  1. 在 Azure 存储中,创建一个名为 doc-extract-multimodality-container 的新容器

  2. 上传示例数据文件

  3. 在 Azure 存储中创建角色分配并在连接字符串中指定托管标识

  4. 对于使用系统分配的托管标识建立的连接。 提供包含 ResourceId 且没有帐户密钥或密码的连接字符串。 ResourceId 必须包括存储帐户的订阅 ID、存储帐户的资源组和存储帐户名称。 连接字符串如以下示例所示:

    "credentials" : { 
        "connectionString" : "ResourceId=/subscriptions/00000000-0000-0000-0000-00000000/resourceGroups/MY-DEMO-RESOURCE-GROUP/providers/Microsoft.Storage/storageAccounts/MY-DEMO-STORAGE-ACCOUNT/;" 
    }
    
  5. 对于使用用户分配的托管标识建立的连接。 提供包含 ResourceId 且没有帐户密钥或密码的连接字符串。 ResourceId 必须包括存储帐户的订阅 ID、存储帐户的资源组和存储帐户名称。 使用以下示例中所示的语法提供标识。 将 userAssignedIdentity 设置为用户分配的托管标识 连接字符串类似于以下示例:

    "credentials" : { 
        "connectionString" : "ResourceId=/subscriptions/00000000-0000-0000-0000-00000000/resourceGroups/MY-DEMO-RESOURCE-GROUP/providers/Microsoft.Storage/storageAccounts/MY-DEMO-STORAGE-ACCOUNT/;" 
    },
    "identity" : { 
        "@odata.type": "#Microsoft.Azure.Search.DataUserAssignedIdentity",
        "userAssignedIdentity" : "/subscriptions/00000000-0000-0000-0000-00000000/resourcegroups/MY-DEMO-RESOURCE-GROUP/providers/Microsoft.ManagedIdentity/userAssignedIdentities/MY-DEMO-USER-MANAGED-IDENTITY" 
    }
    

复制搜索服务 URL 和 API 密钥

对于本教程,与 Azure AI 搜索的连接需要终结点和 API 密钥。 可以从 Azure 门户获取这些值。 有关备用连接方法,请参阅托管标识

  1. 登录到 Azure 门户,导航到搜索服务“概述”页,然后复制 URL。 示例终结点可能类似于 https://mydemo.search.windows.net

  2. 在“设置”“密钥”下,复制管理密钥>。 管理密钥用于添加、修改和删除对象。 有两个可互换的管理密钥。 复制其中任意一个。

    Azure 门户中 URL 和 API 密钥的屏幕截图。

配置您的 REST 文件

  1. 启动 Visual Studio Code,并创建一个新文件。

  2. 为请求中使用的变量提供值。

    @baseUrl = PUT-YOUR-SEARCH-SERVICE-ENDPOINT-HERE
    @apiKey = PUT-YOUR-ADMIN-API-KEY-HERE
    @storageConnection = PUT-YOUR-STORAGE-CONNECTION-STRING-HERE
    @cognitiveServicesUrl = PUT-YOUR-COGNITIVE-SERVICES-URL-HERE
    @cognitiveServicesKey= PUT-YOUR-COGNITIVE-SERVICES-URL-KEY-HERE
    @modelVersion = PUT-YOUR-VECTORIZE-MODEL-VERSION-HERE
    @imageProjectionContainer=PUT-YOUR-IMAGE-PROJECTION-CONTAINER-HERE
    
  3. 使用 .rest.http 文件扩展名保存文件。

有关 REST 客户端的帮助,请参阅快速入门:使用 REST 进行关键字搜索

创建数据源

创建数据源 (REST) 会创建数据源连接,用于指定要编制索引的数据。

### Create a data source
POST {{baseUrl}}/datasources?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}

  {
    "name": "doc-extraction-multimodal-embedding-ds",
    "description": null,
    "type": "azureblob",
    "subtype": null,
    "credentials": {
      "connectionString":  "{{storageConnection}}"
    },
    "container": {
      "name": "doc-extraction-multimodality-container",
      "query": null
    },
    "dataChangeDetectionPolicy": null,
    "dataDeletionDetectionPolicy": null,
    "encryptionKey": null,
    "identity": null
  }

发送请求。 响应应如下所示:

HTTP/1.1 201 Created
Transfer-Encoding: chunked
Content-Type: application/json; odata.metadata=minimal; odata.streaming=true; charset=utf-8
Location: https://<YOUR-SEARCH-SERVICE-NAME>.search.windows-int.net:443/datasources('doc-extraction-multimodal-embedding-ds')?api-version=2025-05-01-preview -Preview
Server: Microsoft-IIS/10.0
Strict-Transport-Security: max-age=2592000, max-age=15724800; includeSubDomains
Preference-Applied: odata.include-annotations="*"
OData-Version: 4.0
request-id: 4eb8bcc3-27b5-44af-834e-295ed078e8ed
elapsed-time: 346
Date: Sat, 26 Apr 2025 21:25:24 GMT
Connection: close

{
  "name": "doc-extraction-multimodal-embedding-ds",
  "description": "A test datasource",
  "type": "azureblob",
  "subtype": null,
  "indexerPermissionOptions": [],
  "credentials": {
    "connectionString": null
  },
  "container": {
    "name": "doc-extraction-multimodality-container",
    "query": null
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": null,
  "encryptionKey": null,
  "identity": null
}

创建索引

创建索引 (REST) 会在搜索服务中创建搜索索引。 索引指定所有参数及其属性。

对于嵌套 JSON,索引字段必须与源字段相同。 目前,Azure AI 搜索不支持将字段映射到嵌套 JSON,因此字段名称和数据类型必须完全匹配。 以下索引与原始内容中的 JSON 元素保持一致。

### Create an index
POST {{baseUrl}}/indexes?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}

{
    "name": "doc-extraction-multimodal-embedding-index",
    "fields": [
        {
            "name": "content_id",
            "type": "Edm.String",
            "retrievable": true,
            "key": true,
            "analyzer": "keyword"
        },
        {
            "name": "text_document_id",
            "type": "Edm.String",
            "searchable": false,
            "filterable": true,
            "retrievable": true,
            "stored": true,
            "sortable": false,
            "facetable": false
        },          
        {
            "name": "document_title",
            "type": "Edm.String",
            "searchable": true
        },
        {
            "name": "image_document_id",
            "type": "Edm.String",
            "filterable": true,
            "retrievable": true
        },
        {
            "name": "content_text",
            "type": "Edm.String",
            "searchable": true,
            "retrievable": true
        },
        {
            "name": "content_embedding",
            "type": "Collection(Edm.Single)",
            "dimensions": 1024,
            "searchable": true,
            "retrievable": true,
            "vectorSearchProfile": "hnsw"
        },
        {
            "name": "content_path",
            "type": "Edm.String",
            "searchable": false,
            "retrievable": true
        },
        {
            "name": "offset",
            "type": "Edm.String",
            "searchable": false,
            "retrievable": true
        },
        {
            "name": "location_metadata",
            "type": "Edm.ComplexType",
            "fields": [
                {
                "name": "page_number",
                "type": "Edm.Int32",
                "searchable": false,
                "retrievable": true
                },
                {
                "name": "bounding_polygons",
                "type": "Edm.String",
                "searchable": false,
                "retrievable": true,
                "filterable": false,
                "sortable": false,
                "facetable": false
                }
            ]
        }         
    ],
    "vectorSearch": {
        "profiles": [
            {
                "name": "hnsw",
                "algorithm": "defaulthnsw",
                "vectorizer": "{{vectorizer}}"
            }
        ],
        "algorithms": [
            {
                "name": "defaulthnsw",
                "kind": "hnsw",
                "hnswParameters": {
                    "m": 4,
                    "efConstruction": 400,
                    "metric": "cosine"
                }
            }
        ],
        "vectorizers": [
            {
                "name": "{{ vectorizer }}",
                "kind": "aiServicesVision",
                "aiServicesVisionParameters": {
                    "resourceUri": "{{cognitiveServicesUrl}}",
                    "apiKey": "{{cognitiveServicesKey}}",
                    "modelVersion": "{{modelVersion}}"
                }
            }
        ]     
    },
    "semantic": {
        "defaultConfiguration": "semanticconfig",
        "configurations": [
            {
                "name": "semanticconfig",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "document_title"
                    },
                    "prioritizedContentFields": [
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    }
}

要点:

  • 文本和图像嵌入存储在 content_embedding 字段中,并且必须配置适当的大小(例如 1024)和矢量搜索配置文件。

  • location_metadata 捕获每个规范化图像的边界多边形和页码元数据,从而实现精确的空间搜索或 UI 叠加。 location_metadata 仅存在于此应用场景中的图像中。 若要捕获文本的位置元数据,请考虑使用 文档布局技能。 深入教程在页面底部链接。

  • 有关矢量搜索的详细信息,请参阅 Azure AI 搜索中的矢量

  • 有关语义排名的详细信息,请参阅 Azure AI 搜索中的语义排名

创建技能集

创建 Skillset(REST) 会在您的搜索服务上创建一个搜索索引。 索引指定所有参数及其属性。

### Create a skillset
POST {{baseUrl}}/skillsets?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}

{
  "name": "doc-extraction-multimodal-embedding-skillset",
	"description": "A test skillset",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "document-extraction-skill",
      "description": "Document extraction skill to exract text and images from documents",
      "parsingMode": "default",
      "dataToExtract": "contentAndMetadata",
      "configuration": {
          "imageAction": "generateNormalizedImages",
          "normalizedImageMaxWidth": 2000,
          "normalizedImageMaxHeight": 2000
      },
      "context": "/document",
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data"
        }
      ],
      "outputs": [
        {
          "name": "content",
          "targetName": "extracted_content"
        },
        {
          "name": "normalized_images",
          "targetName": "normalized_images"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "split-skill",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 200,
      "unit": "characters",
      "inputs": [
        {
          "name": "text",
          "source": "/document/extracted_content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },  
  { 
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill", 
    "name": "text-embedding-skill",
    "description": "Vision Vectorization skill for text",
    "context": "/document/pages/*", 
    "modelVersion": "{{modelVersion}}", 
    "inputs": [ 
      { 
        "name": "text", 
        "source": "/document/pages/*" 
      } 
    ], 
    "outputs": [ 
      { 
        "name": "vector",
        "targetName": "text_vector"
      } 
    ] 
  },
  { 
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill", 
    "name": "image-embedding-skill",
    "description": "Vision Vectorization skill for images",
    "context": "/document/normalized_images/*", 
    "modelVersion": "{{modelVersion}}", 
    "inputs": [ 
      { 
        "name": "image", 
        "source": "/document/normalized_images/*" 
      } 
    ], 
    "outputs": [ 
      { 
        "name": "vector",
  "targetName": "image_vector"
      } 
    ] 
  },  
    {
      "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
      "name": "shaper-skill",
      "description": "Shaper skill to reshape the data to fit the index schema"
      "context": "/document/normalized_images/*",
      "inputs": [
        {
          "name": "normalized_images",
          "source": "/document/normalized_images/*",
          "inputs": []
        },
        {
          "name": "imagePath",
          "source": "='{{imageProjectionContainer}}/'+$(/document/normalized_images/*/imagePath)",
          "inputs": []
        },
        {
          "name": "dataUri",
          "source": "='data:image/jpeg;base64,'+$(/document/normalized_images/*/data)",
          "inputs": []
        },
        {
          "name": "location_metadata",
          "sourceContext": "/document/normalized_images/*",
          "inputs": [
            {
              "name": "page_number",
              "source": "/document/normalized_images/*/pageNumber"
            },
            {
              "name": "bounding_polygons",
              "source": "/document/normalized_images/*/boundingPolygon"
            }              
          ]
        }          
      ],
      "outputs": [
        {
          "name": "output",
          "targetName": "new_normalized_images"
        }
      ]
    }  
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.AIServicesByKey",
    "subdomainUrl": "{{cognitiveServicesUrl}}",
    "key": "{{cognitiveServicesKey}}"
  },
  "indexProjections": {
      "selectors": [
        {
          "targetIndexName": "doc-extraction-multimodal-embedding-index",
          "parentKeyFieldName": "text_document_id",
          "sourceContext": "/document/pages/*",
          "mappings": [              
            {
              "name": "content_embedding",
              "source": "/document/pages/*/text_vector"
            },
            {
              "name": "content_text",
              "source": "/document/pages/*"
            },             
            {
              "name": "document_title",
              "source": "/document/document_title"
            }      
          ]
        },
        {
          "targetIndexName": "doc-extraction-multimodal-embedding-index",
          "parentKeyFieldName": "image_document_id",
          "sourceContext": "/document/normalized_images/*",
          "mappings": [                                   
            {
              "name": "content_embedding",
              "source": "/document/normalized_images/*/image_vector"
            },
            {
              "name": "content_path",
              "source": "/document/normalized_images/*/new_normalized_images/imagePath"
            },
            {
              "name": "location_metadata",
              "source": "/document/normalized_images/*/new_normalized_images/location_metadata"
            },                      
            {
              "name": "document_title",
              "source": "/document/document_title"
            }                
          ]
        }
      ],
      "parameters": {
        "projectionMode": "skipIndexingParentDocuments"
      }
  },
  "knowledgeStore": {
    "storageConnectionString": "{{storageConnection}}",
    "projections": [
      {
        "files": [
          {
            "storageContainer": "{{imageProjectionContainer}}",
            "source": "/document/normalized_images/*"
          }
        ]
      }
    ]
  }
}

此技能集提取文本和图像、向量化图像元数据,以便投影到索引中。

要点:

  • content_text字段填充了通过文档提取技能提取的文本,并通过拆分技能对其进行分块处理。

  • content_path 包含指定图像投影容器中图像文件的相对路径。 仅当 imageAction 设置为 generateNormalizedImages 时,才会为从 PDF 中提取的图像生成此字段,并且可以从源字段 /document/normalized_images/*/imagePath 的扩充文档中进行映射。

  • Azure AI 视觉多模式嵌入技能能够通过区分输入类型(文本或图像),使用相同的技能类型进行文本和视觉数据的嵌入。 有关详细信息,请参阅 Azure AI 视觉多模式嵌入技能

创建并运行索引器

创建索引器会在搜索服务上创建索引器。 索引器连接到数据源、加载数据、运行技能集和为扩充数据编制索引。

### Create and run an indexer
POST {{baseUrl}}/indexers?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}

{
  "dataSourceName": "doc-extraction-multimodal-embedding-ds",
  "targetIndexName": "doc-extraction-multimodal-embedding-index",
  "skillsetName": "doc-extraction-multimodal-embedding-skillset",
  "parameters": {
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": 0,
    "batchSize": 1,
    "configuration": {
      "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "document_title"
    }
  ],
  "outputFieldMappings": []
}

运行查询

加载第一个文档后,可立即开始搜索。

### Query the index
POST {{baseUrl}}/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}
  
  {
    "search": "*",
    "count": true
  }

发送请求。 这是一个未指定的全文搜索查询,它返回索引中标记为可检索的所有字段,以及文档计数。 响应应如下所示:

HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/json; odata.metadata=minimal; odata.streaming=true; charset=utf-8
Content-Encoding: gzip
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Strict-Transport-Security: max-age=2592000, max-age=15724800; includeSubDomains
Preference-Applied: odata.include-annotations="*"
OData-Version: 4.0
request-id: 712ca003-9493-40f8-a15e-cf719734a805
elapsed-time: 198
Date: Wed, 30 Apr 2025 23:20:53 GMT
Connection: close

{
  "@odata.count": 100,
  "@search.nextPageParameters": {
    "search": "*",
    "count": true,
    "skip": 50
  },
  "value": [
  ],
  "@odata.nextLink": "https://<YOUR-SEARCH-SERVICE-NAME>.search.windows.net/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview "
}

响应中返回 100 个文档。

对于筛选器,还可以使用逻辑运算符(and、or、not)和比较运算符(eq、ne、gt、lt、ge、le)。 字符串比较区分大小写。 有关详细信息和示例,请参阅 简单搜索查询的示例

注释

$filter 参数仅适用于在创建索引期间标记为可筛选的字段。

### Query for only images
POST {{baseUrl}}/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}
  
  {
    "search": "*",
    "count": true,
    "filter": "image_document_id ne null"
  }
### Query for text or images with content related to energy, returning the id, parent document, and text (only populated for text chunks), and the content path where the image is saved in the knowledge store (only populated for images)
POST {{baseUrl}}/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{apiKey}}
  

  {
    "search": "energy",
    "count": true,
    "select": "content_id, document_title, content_text, content_path"
  }

重置并重新运行

可以重置索引器以清除高水位标记,从而实现完全重新运行。 以下 POST 请求用于重置,然后重新运行。

### Reset the indexer
POST {{baseUrl}}/indexers/doc-extraction-multimodal-embedding-indexer/reset?api-version=2025-05-01-preview   HTTP/1.1
  api-key: {{apiKey}}
### Run the indexer
POST {{baseUrl}}/indexers/doc-extraction-multimodal-embedding-indexer/run?api-version=2025-05-01-preview   HTTP/1.1
  api-key: {{apiKey}}
### Check indexer status 
GET {{baseUrl}}/indexers/doc-extraction-multimodal-embedding-indexer/status?api-version=2025-05-01-preview   HTTP/1.1
  api-key: {{apiKey}}

清理资源

在自己的订阅中操作时,最好在项目结束时移除不再需要的资源。 持续运行资源可能会产生费用。 可以逐个删除资源,也可以删除资源组以删除整个资源集。

你可以使用 Azure 门户来删除索引、索引器和数据源。

另请参阅

熟悉多模式索引方案的示例实现后,请查看: