你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
Azure AI 搜索可以从 Azure Blob 存储中存储的 PDF 文档中提取和索引文本和图像。 本教程介绍如何通过将文本和图像嵌入统一语义搜索索引来生成多模式索引管道。
在本教程中,你使用:
一个包含 36 页的 PDF 文档,它将丰富的视觉内容(如图表、信息图和扫描的页面)与传统文本相结合。
用于提取文本和规范化图像的文档提取技能。
使用 Azure AI 视觉多模式嵌入技能进行矢量化,该技能为文本和图像生成嵌入。
配置用于存储文本和图像嵌入并支持基于矢量的相似性搜索的搜索索引。
本教程演示了使用文档提取技能和图像标题为多模式内容编制索引的低成本方法。 它支持从 Azure Blob 存储中的文档提取和搜索文本和图像。 但是,它不包括文本的位置元数据,例如页码或边界区域。
如需包含结构化文本布局和空间元数据的更全面解决方案,请参阅使用图像语言描述和文档布局技能为多模 RAG 方案中包含文本和图像的 Blob 编制索引。
注释
根据本教程的要求将 imageAction
设置为 generateNormalizedImages
会产生额外图像提取费用,具体费用取决于 Azure AI 搜索定价。
使用 REST 客户端和 搜索 REST API ,你将:
- 设置示例数据并配置
azureblob
数据源 - 创建支持文本和图像嵌入的索引
- 使用提取和嵌入步骤定义技能集
- 创建并运行索引器来处理和索引内容
- 搜索刚刚创建的索引
先决条件
拥有有效订阅的 Azure 帐户。 免费创建帐户。
用于图像矢量化的 Azure AI 服务多服务帐户。 图像矢量化需要 Azure AI 视觉多模态嵌入。 有关区域更新的列表,请参阅 Azure AI 视觉文档。
具有托管标识的 Azure AI 搜索。 创建服务或在当前订阅中查找现有服务。
服务必须位于“基本”层或更高层-本教程在免费层上不受支持。 它还必须与多服务帐户位于同一区域。
下载文件
下载以下示例 PDF:
将示例数据上传到 Azure 存储
在 Azure 存储中,创建一个名为 doc-extract-multimodality-container 的新容器。
对于使用系统分配的托管标识建立的连接。 提供包含 ResourceId 且没有帐户密钥或密码的连接字符串。 ResourceId 必须包括存储帐户的订阅 ID、存储帐户的资源组和存储帐户名称。 连接字符串如以下示例所示:
"credentials" : { "connectionString" : "ResourceId=/subscriptions/00000000-0000-0000-0000-00000000/resourceGroups/MY-DEMO-RESOURCE-GROUP/providers/Microsoft.Storage/storageAccounts/MY-DEMO-STORAGE-ACCOUNT/;" }
对于使用用户分配的托管标识建立的连接。 提供包含 ResourceId 且没有帐户密钥或密码的连接字符串。 ResourceId 必须包括存储帐户的订阅 ID、存储帐户的资源组和存储帐户名称。 使用以下示例中所示的语法提供标识。 将 userAssignedIdentity 设置为用户分配的托管标识 连接字符串类似于以下示例:
"credentials" : { "connectionString" : "ResourceId=/subscriptions/00000000-0000-0000-0000-00000000/resourceGroups/MY-DEMO-RESOURCE-GROUP/providers/Microsoft.Storage/storageAccounts/MY-DEMO-STORAGE-ACCOUNT/;" }, "identity" : { "@odata.type": "#Microsoft.Azure.Search.DataUserAssignedIdentity", "userAssignedIdentity" : "/subscriptions/00000000-0000-0000-0000-00000000/resourcegroups/MY-DEMO-RESOURCE-GROUP/providers/Microsoft.ManagedIdentity/userAssignedIdentities/MY-DEMO-USER-MANAGED-IDENTITY" }
复制搜索服务 URL 和 API 密钥
对于本教程,与 Azure AI 搜索的连接需要终结点和 API 密钥。 可以从 Azure 门户获取这些值。 有关备用连接方法,请参阅托管标识。
登录到 Azure 门户,导航到搜索服务“概述”页,然后复制 URL。 示例终结点可能类似于
https://mydemo.search.windows.net
。在“设置”“密钥”下,复制管理密钥>。 管理密钥用于添加、修改和删除对象。 有两个可互换的管理密钥。 复制其中任意一个。
配置您的 REST 文件
启动 Visual Studio Code,并创建一个新文件。
为请求中使用的变量提供值。
@baseUrl = PUT-YOUR-SEARCH-SERVICE-ENDPOINT-HERE @apiKey = PUT-YOUR-ADMIN-API-KEY-HERE @storageConnection = PUT-YOUR-STORAGE-CONNECTION-STRING-HERE @cognitiveServicesUrl = PUT-YOUR-COGNITIVE-SERVICES-URL-HERE @cognitiveServicesKey= PUT-YOUR-COGNITIVE-SERVICES-URL-KEY-HERE @modelVersion = PUT-YOUR-VECTORIZE-MODEL-VERSION-HERE @imageProjectionContainer=PUT-YOUR-IMAGE-PROJECTION-CONTAINER-HERE
使用
.rest
或.http
文件扩展名保存文件。
有关 REST 客户端的帮助,请参阅快速入门:使用 REST 进行关键字搜索。
创建数据源
创建数据源 (REST) 会创建数据源连接,用于指定要编制索引的数据。
### Create a data source
POST {{baseUrl}}/datasources?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"name": "doc-extraction-multimodal-embedding-ds",
"description": null,
"type": "azureblob",
"subtype": null,
"credentials": {
"connectionString": "{{storageConnection}}"
},
"container": {
"name": "doc-extraction-multimodality-container",
"query": null
},
"dataChangeDetectionPolicy": null,
"dataDeletionDetectionPolicy": null,
"encryptionKey": null,
"identity": null
}
发送请求。 响应应如下所示:
HTTP/1.1 201 Created
Transfer-Encoding: chunked
Content-Type: application/json; odata.metadata=minimal; odata.streaming=true; charset=utf-8
Location: https://<YOUR-SEARCH-SERVICE-NAME>.search.windows-int.net:443/datasources('doc-extraction-multimodal-embedding-ds')?api-version=2025-05-01-preview -Preview
Server: Microsoft-IIS/10.0
Strict-Transport-Security: max-age=2592000, max-age=15724800; includeSubDomains
Preference-Applied: odata.include-annotations="*"
OData-Version: 4.0
request-id: 4eb8bcc3-27b5-44af-834e-295ed078e8ed
elapsed-time: 346
Date: Sat, 26 Apr 2025 21:25:24 GMT
Connection: close
{
"name": "doc-extraction-multimodal-embedding-ds",
"description": "A test datasource",
"type": "azureblob",
"subtype": null,
"indexerPermissionOptions": [],
"credentials": {
"connectionString": null
},
"container": {
"name": "doc-extraction-multimodality-container",
"query": null
},
"dataChangeDetectionPolicy": null,
"dataDeletionDetectionPolicy": null,
"encryptionKey": null,
"identity": null
}
创建索引
创建索引 (REST) 会在搜索服务中创建搜索索引。 索引指定所有参数及其属性。
对于嵌套 JSON,索引字段必须与源字段相同。 目前,Azure AI 搜索不支持将字段映射到嵌套 JSON,因此字段名称和数据类型必须完全匹配。 以下索引与原始内容中的 JSON 元素保持一致。
### Create an index
POST {{baseUrl}}/indexes?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"name": "doc-extraction-multimodal-embedding-index",
"fields": [
{
"name": "content_id",
"type": "Edm.String",
"retrievable": true,
"key": true,
"analyzer": "keyword"
},
{
"name": "text_document_id",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"stored": true,
"sortable": false,
"facetable": false
},
{
"name": "document_title",
"type": "Edm.String",
"searchable": true
},
{
"name": "image_document_id",
"type": "Edm.String",
"filterable": true,
"retrievable": true
},
{
"name": "content_text",
"type": "Edm.String",
"searchable": true,
"retrievable": true
},
{
"name": "content_embedding",
"type": "Collection(Edm.Single)",
"dimensions": 1024,
"searchable": true,
"retrievable": true,
"vectorSearchProfile": "hnsw"
},
{
"name": "content_path",
"type": "Edm.String",
"searchable": false,
"retrievable": true
},
{
"name": "offset",
"type": "Edm.String",
"searchable": false,
"retrievable": true
},
{
"name": "location_metadata",
"type": "Edm.ComplexType",
"fields": [
{
"name": "page_number",
"type": "Edm.Int32",
"searchable": false,
"retrievable": true
},
{
"name": "bounding_polygons",
"type": "Edm.String",
"searchable": false,
"retrievable": true,
"filterable": false,
"sortable": false,
"facetable": false
}
]
}
],
"vectorSearch": {
"profiles": [
{
"name": "hnsw",
"algorithm": "defaulthnsw",
"vectorizer": "{{vectorizer}}"
}
],
"algorithms": [
{
"name": "defaulthnsw",
"kind": "hnsw",
"hnswParameters": {
"m": 4,
"efConstruction": 400,
"metric": "cosine"
}
}
],
"vectorizers": [
{
"name": "{{ vectorizer }}",
"kind": "aiServicesVision",
"aiServicesVisionParameters": {
"resourceUri": "{{cognitiveServicesUrl}}",
"apiKey": "{{cognitiveServicesKey}}",
"modelVersion": "{{modelVersion}}"
}
}
]
},
"semantic": {
"defaultConfiguration": "semanticconfig",
"configurations": [
{
"name": "semanticconfig",
"prioritizedFields": {
"titleField": {
"fieldName": "document_title"
},
"prioritizedContentFields": [
],
"prioritizedKeywordsFields": []
}
}
]
}
}
要点:
文本和图像嵌入存储在
content_embedding
字段中,并且必须配置适当的大小(例如 1024)和矢量搜索配置文件。location_metadata
捕获每个规范化图像的边界多边形和页码元数据,从而实现精确的空间搜索或 UI 叠加。location_metadata
仅存在于此应用场景中的图像中。 若要捕获文本的位置元数据,请考虑使用 文档布局技能。 深入教程在页面底部链接。有关矢量搜索的详细信息,请参阅 Azure AI 搜索中的矢量。
有关语义排名的详细信息,请参阅 Azure AI 搜索中的语义排名
创建技能集
创建 Skillset(REST) 会在您的搜索服务上创建一个搜索索引。 索引指定所有参数及其属性。
### Create a skillset
POST {{baseUrl}}/skillsets?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"name": "doc-extraction-multimodal-embedding-skillset",
"description": "A test skillset",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
"name": "document-extraction-skill",
"description": "Document extraction skill to exract text and images from documents",
"parsingMode": "default",
"dataToExtract": "contentAndMetadata",
"configuration": {
"imageAction": "generateNormalizedImages",
"normalizedImageMaxWidth": 2000,
"normalizedImageMaxHeight": 2000
},
"context": "/document",
"inputs": [
{
"name": "file_data",
"source": "/document/file_data"
}
],
"outputs": [
{
"name": "content",
"targetName": "extracted_content"
},
{
"name": "normalized_images",
"targetName": "normalized_images"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "split-skill",
"description": "Split skill to chunk documents",
"context": "/document",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 200,
"unit": "characters",
"inputs": [
{
"name": "text",
"source": "/document/extracted_content",
"inputs": []
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
"name": "text-embedding-skill",
"description": "Vision Vectorization skill for text",
"context": "/document/pages/*",
"modelVersion": "{{modelVersion}}",
"inputs": [
{
"name": "text",
"source": "/document/pages/*"
}
],
"outputs": [
{
"name": "vector",
"targetName": "text_vector"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
"name": "image-embedding-skill",
"description": "Vision Vectorization skill for images",
"context": "/document/normalized_images/*",
"modelVersion": "{{modelVersion}}",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "vector",
"targetName": "image_vector"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"name": "shaper-skill",
"description": "Shaper skill to reshape the data to fit the index schema"
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "normalized_images",
"source": "/document/normalized_images/*",
"inputs": []
},
{
"name": "imagePath",
"source": "='{{imageProjectionContainer}}/'+$(/document/normalized_images/*/imagePath)",
"inputs": []
},
{
"name": "dataUri",
"source": "='data:image/jpeg;base64,'+$(/document/normalized_images/*/data)",
"inputs": []
},
{
"name": "location_metadata",
"sourceContext": "/document/normalized_images/*",
"inputs": [
{
"name": "page_number",
"source": "/document/normalized_images/*/pageNumber"
},
{
"name": "bounding_polygons",
"source": "/document/normalized_images/*/boundingPolygon"
}
]
}
],
"outputs": [
{
"name": "output",
"targetName": "new_normalized_images"
}
]
}
],
"cognitiveServices": {
"@odata.type": "#Microsoft.Azure.Search.AIServicesByKey",
"subdomainUrl": "{{cognitiveServicesUrl}}",
"key": "{{cognitiveServicesKey}}"
},
"indexProjections": {
"selectors": [
{
"targetIndexName": "doc-extraction-multimodal-embedding-index",
"parentKeyFieldName": "text_document_id",
"sourceContext": "/document/pages/*",
"mappings": [
{
"name": "content_embedding",
"source": "/document/pages/*/text_vector"
},
{
"name": "content_text",
"source": "/document/pages/*"
},
{
"name": "document_title",
"source": "/document/document_title"
}
]
},
{
"targetIndexName": "doc-extraction-multimodal-embedding-index",
"parentKeyFieldName": "image_document_id",
"sourceContext": "/document/normalized_images/*",
"mappings": [
{
"name": "content_embedding",
"source": "/document/normalized_images/*/image_vector"
},
{
"name": "content_path",
"source": "/document/normalized_images/*/new_normalized_images/imagePath"
},
{
"name": "location_metadata",
"source": "/document/normalized_images/*/new_normalized_images/location_metadata"
},
{
"name": "document_title",
"source": "/document/document_title"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
},
"knowledgeStore": {
"storageConnectionString": "{{storageConnection}}",
"projections": [
{
"files": [
{
"storageContainer": "{{imageProjectionContainer}}",
"source": "/document/normalized_images/*"
}
]
}
]
}
}
此技能集提取文本和图像、向量化图像元数据,以便投影到索引中。
要点:
该
content_text
字段填充了通过文档提取技能提取的文本,并通过拆分技能对其进行分块处理。content_path
包含指定图像投影容器中图像文件的相对路径。 仅当imageAction
设置为generateNormalizedImages
时,才会为从 PDF 中提取的图像生成此字段,并且可以从源字段/document/normalized_images/*/imagePath
的扩充文档中进行映射。Azure AI 视觉多模式嵌入技能能够通过区分输入类型(文本或图像),使用相同的技能类型进行文本和视觉数据的嵌入。 有关详细信息,请参阅 Azure AI 视觉多模式嵌入技能。
创建并运行索引器
创建索引器会在搜索服务上创建索引器。 索引器连接到数据源、加载数据、运行技能集和为扩充数据编制索引。
### Create and run an indexer
POST {{baseUrl}}/indexers?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"dataSourceName": "doc-extraction-multimodal-embedding-ds",
"targetIndexName": "doc-extraction-multimodal-embedding-index",
"skillsetName": "doc-extraction-multimodal-embedding-skillset",
"parameters": {
"maxFailedItems": -1,
"maxFailedItemsPerBatch": 0,
"batchSize": 1,
"configuration": {
"allowSkillsetToReadFileData": true
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "document_title"
}
],
"outputFieldMappings": []
}
运行查询
加载第一个文档后,可立即开始搜索。
### Query the index
POST {{baseUrl}}/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"search": "*",
"count": true
}
发送请求。 这是一个未指定的全文搜索查询,它返回索引中标记为可检索的所有字段,以及文档计数。 响应应如下所示:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/json; odata.metadata=minimal; odata.streaming=true; charset=utf-8
Content-Encoding: gzip
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Strict-Transport-Security: max-age=2592000, max-age=15724800; includeSubDomains
Preference-Applied: odata.include-annotations="*"
OData-Version: 4.0
request-id: 712ca003-9493-40f8-a15e-cf719734a805
elapsed-time: 198
Date: Wed, 30 Apr 2025 23:20:53 GMT
Connection: close
{
"@odata.count": 100,
"@search.nextPageParameters": {
"search": "*",
"count": true,
"skip": 50
},
"value": [
],
"@odata.nextLink": "https://<YOUR-SEARCH-SERVICE-NAME>.search.windows.net/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview "
}
响应中返回 100 个文档。
对于筛选器,还可以使用逻辑运算符(and、or、not)和比较运算符(eq、ne、gt、lt、ge、le)。 字符串比较区分大小写。 有关详细信息和示例,请参阅 简单搜索查询的示例。
注释
该 $filter
参数仅适用于在创建索引期间标记为可筛选的字段。
### Query for only images
POST {{baseUrl}}/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"search": "*",
"count": true,
"filter": "image_document_id ne null"
}
### Query for text or images with content related to energy, returning the id, parent document, and text (only populated for text chunks), and the content path where the image is saved in the knowledge store (only populated for images)
POST {{baseUrl}}/indexes/doc-extraction-multimodal-embedding-index/docs/search?api-version=2025-05-01-preview HTTP/1.1
Content-Type: application/json
api-key: {{apiKey}}
{
"search": "energy",
"count": true,
"select": "content_id, document_title, content_text, content_path"
}
重置并重新运行
可以重置索引器以清除高水位标记,从而实现完全重新运行。 以下 POST 请求用于重置,然后重新运行。
### Reset the indexer
POST {{baseUrl}}/indexers/doc-extraction-multimodal-embedding-indexer/reset?api-version=2025-05-01-preview HTTP/1.1
api-key: {{apiKey}}
### Run the indexer
POST {{baseUrl}}/indexers/doc-extraction-multimodal-embedding-indexer/run?api-version=2025-05-01-preview HTTP/1.1
api-key: {{apiKey}}
### Check indexer status
GET {{baseUrl}}/indexers/doc-extraction-multimodal-embedding-indexer/status?api-version=2025-05-01-preview HTTP/1.1
api-key: {{apiKey}}
清理资源
在自己的订阅中操作时,最好在项目结束时移除不再需要的资源。 持续运行资源可能会产生费用。 可以逐个删除资源,也可以删除资源组以删除整个资源集。
你可以使用 Azure 门户来删除索引、索引器和数据源。
另请参阅
熟悉多模式索引方案的示例实现后,请查看: