你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
内容理解分析器决定如何处理和提取信息。 它们可确保所有内容的统一处理和输出结构,以便获得可靠且可预测的结果。 对于常见用例,可以使用 预生成的分析器。 本指南介绍如何自定义这些分析器,以更好地满足你的需求。
本指南介绍如何使用 内容理解 REST API 创建自定义分析器,以便从内容中提取结构化数据。
先决条件
- 一个有效的 “Azure” 订阅。 如果没有 Azure 帐户,可免费创建一个帐户。
- 在受支持的区域中创建的Microsoft Foundry 资源。
- 门户在 Foundry>Foundry 下列出此资源。
- 为内容理解资源设置默认模型部署。 通过设置默认设置,可以创建与用于内容理解请求的 Microsoft Foundry 模型的连接。 选择下列方法之一:
- 在开发环境中需要安装 cURL。
定义分析器架构
若要创建自定义分析器,请定义描述要提取的结构化数据的字段架构。 在下面的示例中,将基于 预生成的文档分析器 创建分析器来处理收据。
创建包含以下内容的 JSON 文件 receipt.json :
{
"description": "Sample receipt analyzer",
"baseAnalyzerId": "prebuilt-document",
"models": {
"completion": "gpt-4.1",
"embedding": "text-embedding-3-large"
},
"config": {
"returnDetails": true,
"enableFormula": false,
"estimateFieldSourceAndConfidence": true,
"tableFormat": "html"
},
"fieldSchema": {
"fields": {
"VendorName": {
"type": "string",
"method": "extract",
"description": "Vendor issuing the receipt"
},
"Items": {
"type": "array",
"method": "extract",
"items": {
"type": "object",
"properties": {
"Description": {
"type": "string",
"method": "extract",
"description": "Description of the item"
},
"Amount": {
"type": "number",
"method": "extract",
"description": "Amount of the item"
}
}
}
}
}
}
}
如果需要处理各种类型的文档,但只想对收据进行分类和分析,请首先创建一个分析器来对文档进行分类。 然后,将其路由到之前使用以下架构创建的分析器。
创建包含以下内容的 JSON 文件 categorize.json :
{
"baseAnalyzerId": "prebuilt-document",
// Use the base analyzer to invoke the document specific capabilities.
//Specify the model the analyzer should use. This is one of the supported completion models and one of the supported embeddings model. The specific deployment used during analyze is set on the resource or provided in the analyze request.
"models": {
"completion": "gpt-4.1"
},
"config": {
// Enable splitting of the input into segments. Set this property to false if you only expect a single document within the input file. When specified and enableSegment=false, the whole content will be classified into one of the categories.
"enableSegment": false,
"contentCategories": {
// Category name.
"receipt": {
// Description to help with classification and splitting.
"description": "Any images or documents of receipts",
// Define the analyzer that any content classified as a receipt should be routed to
"analyzerId": "receipt"
},
"invoice": {
"description": "Any images or documents of invoice",
"analyzerId": "prebuilt-invoice"
},
"policeReport": {
"description": "A police or law enforcement report detailing the events that lead to the loss."
// Don't perform analysis for this category.
}
},
// Omit original content object and only return content objects from additional analysis.
"omitContent": true
}
//You can use fieldSchema here to define fields that are needed from the entire input content.
}
创建分析器
PUT 请求
首先创建收据分析器,然后创建分类分析器。
curl -i -X PUT "{endpoint}/contentunderstanding/analyzers/{analyzerId}?api-version=2025-11-01" \
-H "Ocp-Apim-Subscription-Key: {key}" \
-H "Content-Type: application/json" \
-d @receipt.json
PUT 响应
响应 201 Created 包含一个标头 Operation-Location,其中包含一个 URL,可用于跟踪此异步分析器创建操作的状态。
201 Created
Operation-Location: {endpoint}/contentunderstanding/analyzers/{analyzerId}/operations/{operationId}?api-version=2025-05-01-preview
操作完成后,在操作位置 URL 执行 HTTP GET,将返回 "status": "succeeded"。
curl -i -X GET "{endpoint}/contentunderstanding/analyzers/{analyzerId}/operations/{operationId}?api-version=2025-11-01" \
-H "Ocp-Apim-Subscription-Key: {key}"
分析文件
提交文件
现在可以使用创建的自定义分析器来处理文件并提取架构中定义的字段。
在运行 cURL 命令之前,请对 HTTP 请求进行以下更改:
- 将
{endpoint}和{key}替换为 Azure 门户 Foundry 实例中的终结点和密钥值。 - 将
{analyzerId}替换为使用categorize.json文件创建的自定义分析器的名称。 - 将
{fileUrl}替换为要分析的文件的可公开访问 URL,例如具有共享访问签名 (SAS) 或示例 URLhttps://github.com/Azure-Samples/azure-ai-content-understanding-python/raw/refs/heads/main/data/receipt.png的 Azure 存储 Blob 的路径。
POST 请求
此示例使用您借助categorize.json 文件创建的自定义分析器来分析收据。
curl -i -X POST "{endpoint}/contentunderstanding/analyzers/{analyzerId}:analyze?api-version=2025-11-01" \
-H "Ocp-Apim-Subscription-Key: {key}" \
-H "Content-Type: application/json" \
-d '{
"inputs":[
{
"url": "https://github.com/Azure-Samples/azure-ai-content-understanding-python/raw/refs/heads/main/data/receipt.png"
}
]
}'
POST 响应
响应 202 Accepted 包括 {resultId},可用于跟踪此异步作的状态。
{
"id": {resultId},
"status": "Running",
"result": {
"analyzerId": {analyzerId},
"apiVersion": "2025-11-01",
"createdAt": "YYYY-MM-DDTHH:MM:SSZ",
"warnings": [],
"contents": []
}
}
获取分析结果
使用Operation-Location响应中的POST来获取分析结果。
GET 请求
curl -i -X GET "{endpoint}/contentunderstanding/analyzerResults/{resultId}?api-version=2025-11-01" \
-H "Ocp-Apim-Subscription-Key: {key}"
GET 响应
200 OK响应包含一个显示操作进度的status字段。
- 如果操作成功完成,
status则为Succeeded。 - 如果状态为
running或notStarted,请再次手动调用 API 或使用脚本。 在请求之间等待至少一秒。
示例响应
{
"id": {resultId},
"status": "Succeeded",
"result": {
"analyzerId": {analyzerId},
"apiVersion": "2025-11-01",
"createdAt": "YYYY-MM-DDTHH:MM:SSZ",
"warnings": [],
"contents": [
{
"path": "input1/segment1",
"category": "receipt",
"markdown": "Contoso\n\n123 Main Street\nRedmond, WA 98052\n\n987-654-3210\n\n6/10/2019 13:59\nSales Associate: Paul\n\n\n<table>\n<tr>\n<td>2 Surface Pro 6</td>\n<td>$1,998.00</td>\n</tr>\n<tr>\n<td>3 Surface Pen</td>\n<td>$299.97</td>\n</tr>\n</table> ...",
"fields": {
"VendorName": {
"type": "string",
"valueString": "Contoso",
"spans": [{"offset": 0,"length": 7}],
"confidence": 0.996,
"source": "D(1,774.0000,72.0000,974.0000,70.0000,974.0000,111.0000,774.0000,113.0000)"
},
"Items": {
"type": "array",
"valueArray": [
{
"type": "object",
"valueObject": {
"Description": {
"type": "string",
"valueString": "2 Surface Pro 6",
"spans": [ { "offset": 115, "length": 15}],
"confidence": 0.423,
"source": "D(1,704.0000,482.0000,875.0000,482.0000,875.0000,508.0000,704.0000,508.0000)"
},
"Amount": {
"type": "number",
"valueNumber": 1998,
"spans": [{ "offset": 140,"length": 9}
],
"confidence": 0.957,
"source": "D(1,952.0000,482.0000,1048.0000,482.0000,1048.0000,508.0000,952.0000,509.0000)"
}
}
}, ...
]
}
},
"kind": "document",
"startPageNumber": 1,
"endPageNumber": 1,
"unit": "pixel",
"pages": [
{
"pageNumber": 1,
"angle": -0.0944,
"width": 1743,
"height": 878
}
],
"analyzerId": "{analyzerId}",
"mimeType": "image/png"
}
]
},
"usage": {
"documentPages": 1,
"tokens": {
"contextualization": 1000
}
}
}
本指南介绍如何使用内容理解 Python SDK 创建自定义分析器,以便从内容中提取结构化数据。 自定义分析器支持文档、图像、音频和视频内容类型。
先决条件
- 一个有效的 “Azure” 订阅。 如果没有 Azure 帐户,可免费创建一个帐户。
- 在受支持的区域中创建的Microsoft Foundry 资源。
- 资源终结点和 API 密钥(在 Azure 门户中的 密钥和终结点 下找到)。
- 为您的资源配置的模型部署默认设置。 有关安装说明,请参阅 模型和部署 或此一次性 配置脚本 。
- Python 3.9 或更高版本。
设置
使用 pip 安装适用于 Python 的内容理解客户端库:
pip install azure-ai-contentunderstanding(可选)安装用于Microsoft Entra 身份验证的 Azure 标识库:
pip install azure-identity
设置环境变量
若要使用内容理解服务进行身份验证,请在运行示例之前使用自己的值设置环境变量:
-
CONTENTUNDERSTANDING_ENDPOINT- 内容理解资源的终结点。 -
CONTENTUNDERSTANDING_KEY- 内容理解 API 密钥(如果使用 Microsoft Entra ID DefaultAzureCredential,则为可选)。
Windows操作系统
setx CONTENTUNDERSTANDING_ENDPOINT "your-endpoint"
setx CONTENTUNDERSTANDING_KEY "your-key"
Linux/macOS
export CONTENTUNDERSTANDING_ENDPOINT="your-endpoint"
export CONTENTUNDERSTANDING_KEY="your-key"
创建客户端
导入所需的库和模型,然后使用资源终结点和凭据创建客户端。
import os
import time
from azure.ai.contentunderstanding import ContentUnderstandingClient
from azure.core.credentials import AzureKeyCredential
endpoint = os.environ["CONTENTUNDERSTANDING_ENDPOINT"]
key = os.environ["CONTENTUNDERSTANDING_KEY"]
client = ContentUnderstandingClient(
endpoint=endpoint,
credential=AzureKeyCredential(key),
)
创建自定义分析器
以下示例基于 预生成的文档基础分析器创建自定义文档分析器。 它使用三种提取方法定义字段: extract 文本文本、 generate AI 生成的字段或解释以及 classify 分类。
from azure.ai.contentunderstanding.models import (
ContentAnalyzer,
ContentAnalyzerConfig,
ContentFieldSchema,
ContentFieldDefinition,
ContentFieldType,
GenerationMethod,
)
# Generate a unique analyzer ID
analyzer_id = f"my_document_analyzer_{int(time.time())}"
# Define field schema with custom fields
field_schema = ContentFieldSchema(
name="company_schema",
description="Schema for extracting company information",
fields={
"company_name": ContentFieldDefinition(
type=ContentFieldType.STRING,
method=GenerationMethod.EXTRACT,
description="Name of the company",
estimate_source_and_confidence=True,
),
"total_amount": ContentFieldDefinition(
type=ContentFieldType.NUMBER,
method=GenerationMethod.EXTRACT,
description="Total amount on the document",
estimate_source_and_confidence=True,
),
"document_summary": ContentFieldDefinition(
type=ContentFieldType.STRING,
method=GenerationMethod.GENERATE,
description=(
"A brief summary of the document content"
),
),
"document_type": ContentFieldDefinition(
type=ContentFieldType.STRING,
method=GenerationMethod.CLASSIFY,
description="Type of document",
enum=[
"invoice", "receipt", "contract",
"report", "other",
],
),
},
)
# Create analyzer configuration
config = ContentAnalyzerConfig(
enable_formula=True,
enable_layout=True,
enable_ocr=True,
estimate_field_source_and_confidence=True,
return_details=True,
)
# Create the analyzer with field schema
analyzer = ContentAnalyzer(
base_analyzer_id="prebuilt-document",
description=(
"Custom analyzer for extracting company information"
),
config=config,
field_schema=field_schema,
models={
"completion": "gpt-4.1",
"embedding": "text-embedding-3-large",
}, # Required when using field_schema and prebuilt-document base analyzer
)
# Create the analyzer
poller = client.begin_create_analyzer(
analyzer_id=analyzer_id,
resource=analyzer,
)
result = poller.result() # Wait for creation to complete
# Get the full analyzer details after creation
result = client.get_analyzer(analyzer_id=analyzer_id)
print(f"Analyzer '{analyzer_id}' created successfully!")
if result.description:
print(f" Description: {result.description}")
if result.field_schema and result.field_schema.fields:
print(f" Fields ({len(result.field_schema.fields)}):")
for field_name, field_def in result.field_schema.fields.items():
method = field_def.method if field_def.method else "auto"
field_type = field_def.type if field_def.type else "unknown"
print(f" - {field_name}: {field_type} ({method})")
示例输出如下所示:
Analyzer 'my_document_analyzer_ID' created successfully!
Description: Custom analyzer for extracting company information
Fields (4):
- company_name: ContentFieldType.STRING (GenerationMethod.EXTRACT)
- total_amount: ContentFieldType.NUMBER (GenerationMethod.EXTRACT)
- document_summary: ContentFieldType.STRING (GenerationMethod.GENERATE)
- document_type: ContentFieldType.STRING (GenerationMethod.CLASSIFY)
小窍门
此代码基于 SDK 存储库中的 创建分析器示例 。
(可选)可以创建分类器分析器来对文档进行分类,并使用其结果将文档路由到你创建的预生成或自定义分析器。 下面是为分类工作流创建自定义分析器的示例。
import time
from azure.ai.contentunderstanding.models import (
ContentAnalyzer,
ContentAnalyzerConfig,
ContentCategoryDefinition,
)
# Generate a unique analyzer ID
analyzer_id = f"my_classifier_{int(time.time())}"
print(f"Creating classifier '{analyzer_id}'...")
# Define content categories for classification
categories = {
"Loan_Application": ContentCategoryDefinition(
description="Documents submitted by individuals or businesses to request funding, "
"typically including personal or business details, financial history, "
"loan amount, purpose, and supporting documentation."
),
"Invoice": ContentCategoryDefinition(
description="Billing documents issued by sellers or service providers to request "
"payment for goods or services, detailing items, prices, taxes, totals, "
"and payment terms."
),
"Bank_Statement": ContentCategoryDefinition(
description="Official statements issued by banks that summarize account activity "
"over a period, including deposits, withdrawals, fees, and balances."
),
}
# Create analyzer configuration
config = ContentAnalyzerConfig(
return_details=True,
enable_segment=True, # Enable automatic segmentation by category
content_categories=categories,
)
# Create the classifier analyzer
classifier = ContentAnalyzer(
base_analyzer_id="prebuilt-document",
description="Custom classifier for financial document categorization",
config=config,
models={"completion": "gpt-4.1"},
)
# Create the classifier
poller = client.begin_create_analyzer(
analyzer_id=analyzer_id,
resource=classifier,
)
result = poller.result() # Wait for creation to complete
# Get the full analyzer details after creation
result = client.get_analyzer(analyzer_id=analyzer_id)
print(f"Classifier '{analyzer_id}' created successfully!")
if result.description:
print(f" Description: {result.description}")
小窍门
此代码基于 SDK 存储库中的 创建分类器示例 。
使用自定义分析器
创建分析器后,使用它分析文档并提取自定义字段。 不再需要分析器时,请将其删除。
# --- Use the custom document analyzer ---
from azure.ai.contentunderstanding.models import AnalysisInput
print("\nAnalyzing document...")
document_url = (
"https://raw.githubusercontent.com/"
"Azure-Samples/"
"azure-ai-content-understanding-assets/"
"main/document/invoice.pdf"
)
poller = client.begin_analyze(
analyzer_id=analyzer_id,
inputs=[AnalysisInput(url=document_url)],
)
result = poller.result()
if result.contents and len(result.contents) > 0:
content = result.contents[0]
if content.fields:
company = content.fields.get("company_name")
if company:
print(f"Company Name: {company.value}")
if company.confidence:
print(
f" Confidence:"
f" {company.confidence:.2f}"
)
total = content.fields.get("total_amount")
if total:
print(f"Total Amount: {total.value}")
summary = content.fields.get(
"document_summary"
)
if summary:
print(f"Summary: {summary.value}")
doc_type = content.fields.get("document_type")
if doc_type:
print(f"Document Type: {doc_type.value}")
else:
print("No content returned from analysis.")
# --- Clean up ---
print(f"\nCleaning up: deleting analyzer '{analyzer_id}'...")
client.delete_analyzer(analyzer_id=analyzer_id)
print(f"Analyzer '{analyzer_id}' deleted successfully.")
示例输出如下所示:
Analyzing document...
Company Name: CONTOSO LTD.
Confidence: 0.81
Total Amount: 610.0
Summary: This document is an invoice from CONTOSO LTD. to Microsoft Corporation for consulting, document, and printing services provided during the service period. It details line items, subtotal, sales tax, total, previous unpaid balance, and the final amount due.
Document Type: invoice
Cleaning up: deleting analyzer 'my_document_analyzer_ID'...
Analyzer 'my_document_analyzer_ID' deleted successfully.
小窍门
查看更多在 SDK 示例 中运行分析器的例子。
本指南介绍如何使用内容理解 .NET SDK 创建自定义分析器,以便从内容中提取结构化数据。 自定义分析器支持文档、图像、音频和视频内容类型。
先决条件
- 一个有效的 “Azure” 订阅。 如果没有 Azure 帐户,可免费创建一个帐户。
- 在受支持的区域中创建的Microsoft Foundry 资源。
- 资源终结点和 API 密钥(在 Azure 门户中的 密钥和终结点 下找到)。
- 为您的资源配置的模型部署默认设置。 有关安装说明,请参阅 模型和部署 或此一次性 配置脚本 。
- .NET 的当前版本。
设置
创建新的 .NET 控制台应用程序:
dotnet new console -n CustomAnalyzerTutorial cd CustomAnalyzerTutorial安装适用于 .NET 的内容理解客户端库:
dotnet add package Azure.AI.ContentUnderstanding(可选)安装用于Microsoft Entra 身份验证的 Azure 标识库:
dotnet add package Azure.Identity
设置环境变量
若要使用内容理解服务进行身份验证,请在运行示例之前使用自己的值设置环境变量:
-
CONTENTUNDERSTANDING_ENDPOINT- 内容理解资源的终结点。 -
CONTENTUNDERSTANDING_KEY- 内容理解 API 密钥(如果使用 Microsoft Entra ID DefaultAzureCredential,则为可选)。
Windows操作系统
setx CONTENTUNDERSTANDING_ENDPOINT "your-endpoint"
setx CONTENTUNDERSTANDING_KEY "your-key"
Linux/macOS
export CONTENTUNDERSTANDING_ENDPOINT="your-endpoint"
export CONTENTUNDERSTANDING_KEY="your-key"
创建客户端
using Azure;
using Azure.AI.ContentUnderstanding;
string endpoint = Environment.GetEnvironmentVariable(
"CONTENTUNDERSTANDING_ENDPOINT");
string key = Environment.GetEnvironmentVariable(
"CONTENTUNDERSTANDING_KEY");
var client = new ContentUnderstandingClient(
new Uri(endpoint),
new AzureKeyCredential(key)
);
创建自定义分析器
以下示例基于 预生成的文档分析器创建自定义文档分析器。 它使用三种提取方法定义字段: extract 文本文本、 generate AI 生成的摘要和 classify 分类。
string analyzerId =
$"my_document_analyzer_{DateTimeOffset.UtcNow.ToUnixTimeSeconds()}";
var fieldSchema = new ContentFieldSchema(
new Dictionary<string, ContentFieldDefinition>
{
["company_name"] = new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Extract,
Description = "Name of the company"
},
["total_amount"] = new ContentFieldDefinition
{
Type = ContentFieldType.Number,
Method = GenerationMethod.Extract,
Description =
"Total amount on the document"
},
["document_summary"] = new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Generate,
Description =
"A brief summary of the document content"
},
["document_type"] = new ContentFieldDefinition
{
Type = ContentFieldType.String,
Method = GenerationMethod.Classify,
Description = "Type of document"
}
})
{
Name = "company_schema",
Description =
"Schema for extracting company information"
};
fieldSchema.Fields["document_type"].Enum.Add("invoice");
fieldSchema.Fields["document_type"].Enum.Add("receipt");
fieldSchema.Fields["document_type"].Enum.Add("contract");
fieldSchema.Fields["document_type"].Enum.Add("report");
fieldSchema.Fields["document_type"].Enum.Add("other");
var config = new ContentAnalyzerConfig
{
EnableFormula = true,
EnableLayout = true,
EnableOcr = true,
EstimateFieldSourceAndConfidence = true,
ShouldReturnDetails = true
};
var customAnalyzer = new ContentAnalyzer
{
BaseAnalyzerId = "prebuilt-document",
Description =
"Custom analyzer for extracting"
+ " company information",
Config = config,
FieldSchema = fieldSchema
};
customAnalyzer.Models["completion"] = "gpt-4.1";
customAnalyzer.Models["embedding"] =
"text-embedding-3-large"; // Required when using field_schema and prebuilt-document base analyzer
var operation = await client.CreateAnalyzerAsync(
WaitUntil.Completed,
analyzerId,
customAnalyzer);
ContentAnalyzer result = operation.Value;
Console.WriteLine(
$"Analyzer '{analyzerId}'"
+ " created successfully!");
// Get the full analyzer details after creation
var analyzerDetails =
await client.GetAnalyzerAsync(analyzerId);
result = analyzerDetails.Value;
if (result.Description != null)
{
Console.WriteLine(
$" Description: {result.Description}");
}
if (result.FieldSchema?.Fields != null)
{
Console.WriteLine(
$" Fields"
+ $" ({result.FieldSchema.Fields.Count}):");
foreach (var kvp
in result.FieldSchema.Fields)
{
var method =
kvp.Value.Method?.ToString()
?? "auto";
var fieldType =
kvp.Value.Type?.ToString()
?? "unknown";
Console.WriteLine(
$" - {kvp.Key}:"
+ $" {fieldType} ({method})");
}
}
示例输出如下所示:
Analyzer 'my_document_analyzer_ID' created successfully!
Description: Custom analyzer for extracting company information
Fields (4):
- company_name: string (extract)
- total_amount: number (extract)
- document_summary: string (generate)
- document_type: string (classify)
小窍门
此代码基于 SDK 存储库中的 “创建分析器”示例 。
(可选)可以创建分类器分析器来对文档进行分类,并使用其结果将文档路由到你创建的预生成或自定义分析器。 下面是为分类工作流创建自定义分析器的示例。
// Generate a unique analyzer ID
string classifierId =
$"my_classifier_{DateTimeOffset.UtcNow.ToUnixTimeSeconds()}";
Console.WriteLine(
$"Creating classifier '{classifierId}'...");
// Define content categories for classification
var classifierConfig = new ContentAnalyzerConfig
{
ShouldReturnDetails = true,
EnableSegment = true
};
classifierConfig.ContentCategories
.Add("Loan_Application",
new ContentCategoryDefinition
{
Description =
"Documents submitted by individuals"
+ " or businesses to request"
+ " funding, typically including"
+ " personal or business details,"
+ " financial history, loan amount,"
+ " purpose, and supporting"
+ " documentation."
});
classifierConfig.ContentCategories
.Add("Invoice",
new ContentCategoryDefinition
{
Description =
"Billing documents issued by"
+ " sellers or service providers"
+ " to request payment for goods"
+ " or services, detailing items,"
+ " prices, taxes, totals, and"
+ " payment terms."
});
classifierConfig.ContentCategories
.Add("Bank_Statement",
new ContentCategoryDefinition
{
Description =
"Official statements issued by"
+ " banks that summarize account"
+ " activity over a period,"
+ " including deposits,"
+ " withdrawals, fees,"
+ " and balances."
});
// Create the classifier analyzer
var classifierAnalyzer = new ContentAnalyzer
{
BaseAnalyzerId = "prebuilt-document",
Description =
"Custom classifier for financial"
+ " document categorization",
Config = classifierConfig
};
classifierAnalyzer.Models["completion"] =
"gpt-4.1";
var classifierOp =
await client.CreateAnalyzerAsync(
WaitUntil.Completed,
classifierId,
classifierAnalyzer);
// Get the full classifier details
var classifierDetails =
await client.GetAnalyzerAsync(classifierId);
var classifierResult =
classifierDetails.Value;
Console.WriteLine(
$"Classifier '{classifierId}'"
+ " created successfully!");
if (classifierResult.Description != null)
{
Console.WriteLine(
$" Description:"
+ $" {classifierResult.Description}");
}
小窍门
此代码基于创建分类器示例,用于分类工作流。
使用自定义分析器
创建分析器后,使用它分析文档并提取自定义字段。 不再需要分析器时,请将其删除。
var documentUrl = new Uri(
"https://raw.githubusercontent.com/"
+ "Azure-Samples/"
+ "azure-ai-content-understanding-assets/"
+ "main/document/invoice.pdf"
);
var analyzeOperation = await client.AnalyzeAsync(
WaitUntil.Completed,
analyzerId,
inputs: new[] {
new AnalysisInput { Uri = documentUrl }
});
var analyzeResult = analyzeOperation.Value;
if (analyzeResult.Contents?.FirstOrDefault()
is DocumentContent content)
{
if (content.Fields.TryGetValue(
"company_name", out var companyField))
{
var name =
companyField is ContentStringField sf
? sf.Value : null;
Console.WriteLine(
$"Company Name: "
+ $"{name ?? "(not found)"}");
Console.WriteLine(
" Confidence: "
+ (companyField.Confidence?
.ToString("F2") ?? "N/A"));
}
if (content.Fields.TryGetValue(
"total_amount", out var totalField))
{
var total =
totalField is ContentNumberField nf
? nf.Value : null;
Console.WriteLine(
$"Total Amount: {total}");
}
if (content.Fields.TryGetValue(
"document_summary", out var summaryField))
{
var summary =
summaryField is ContentStringField sf
? sf.Value : null;
Console.WriteLine(
$"Summary: "
+ $"{summary ?? "(not found)"}");
}
if (content.Fields.TryGetValue(
"document_type", out var typeField))
{
var docType =
typeField is ContentStringField sf
? sf.Value : null;
Console.WriteLine(
$"Document Type: "
+ $"{docType ?? "(not found)"}");
}
}
// --- Clean up ---
Console.WriteLine(
$"\nCleaning up: deleting analyzer"
+ $" '{analyzerId}'...");
await client.DeleteAnalyzerAsync(analyzerId);
Console.WriteLine(
$"Analyzer '{analyzerId}'"
+ " deleted successfully.");
示例输出如下所示:
Company Name: CONTOSO LTD.
Confidence: 0.88
Total Amount: 610
Summary: This document is an invoice from CONTOSO LTD. to MICROSOFT CORPORATION for consulting services, document fees, and printing fees, detailing service periods, billing and shipping addresses, itemized charges, and the total amount due.
Document Type: invoice
Cleaning up: deleting analyzer 'my_document_analyzer_ID'...
Analyzer 'my_document_analyzer_ID' deleted successfully.
小窍门
查看 有关在 .NET SDK 示例上运行分析器的详细信息。
本指南介绍如何使用内容理解 Java SDK 创建自定义分析器,以便从内容中提取结构化数据。 自定义分析器支持文档、图像、音频和视频内容类型。
先决条件
- 一个有效的 “Azure” 订阅。 如果没有 Azure 帐户,可免费创建一个帐户。
- 在受支持的区域中创建的Microsoft Foundry 资源。
- 资源终结点和 API 密钥(在 Azure 门户中的 密钥和终结点 下找到)。
- 为您的资源配置的模型部署默认设置。 有关安装说明,请参阅 模型和部署 或此一次性 配置脚本 。
- Java 开发工具包 (JDK) 版本 8 或更高版本。
- Apache Maven。
设置
创建新的 Maven 项目:
mvn archetype:generate -DgroupId=com.example \ -DartifactId=custom-analyzer-tutorial \ -DarchetypeArtifactId=maven-archetype-quickstart \ -DinteractiveMode=false cd custom-analyzer-tutorial将内容理解依赖项添加到您的 pom.xml 文件的
<dependencies>节中。<dependency> <groupId>com.azure</groupId> <artifactId>azure-ai-contentunderstanding</artifactId> <version>1.0.0</version> </dependency>(可选)添加用于Microsoft Entra 身份验证的 Azure 标识库:
<dependency> <groupId>com.azure</groupId> <artifactId>azure-identity</artifactId> <version>1.14.2</version> </dependency>
设置环境变量
若要使用内容理解服务进行身份验证,请在运行示例之前使用自己的值设置环境变量:
-
CONTENTUNDERSTANDING_ENDPOINT- 内容理解资源的终结点。 -
CONTENTUNDERSTANDING_KEY- 内容理解 API 密钥(如果使用 Microsoft Entra ID DefaultAzureCredential,则为可选)。
Windows操作系统
setx CONTENTUNDERSTANDING_ENDPOINT "your-endpoint"
setx CONTENTUNDERSTANDING_KEY "your-key"
Linux/macOS
export CONTENTUNDERSTANDING_ENDPOINT="your-endpoint"
export CONTENTUNDERSTANDING_KEY="your-key"
创建客户端
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import com.azure.core.credential.AzureKeyCredential;
import com.azure.core.util.polling.SyncPoller;
import com.azure.ai.contentunderstanding
.ContentUnderstandingClient;
import com.azure.ai.contentunderstanding
.ContentUnderstandingClientBuilder;
import com.azure.ai.contentunderstanding.models.*;
String endpoint =
System.getenv("CONTENTUNDERSTANDING_ENDPOINT");
String key =
System.getenv("CONTENTUNDERSTANDING_KEY");
ContentUnderstandingClient client =
new ContentUnderstandingClientBuilder()
.endpoint(endpoint)
.credential(new AzureKeyCredential(key))
.buildClient();
创建自定义分析器
以下示例基于 预生成的文档分析器创建自定义文档分析器。 它使用三种提取方法定义字段: extract 文本文本、 generate AI 生成的摘要和 classify 分类。
String analyzerId =
"my_document_analyzer_"
+ System.currentTimeMillis();
Map<String, ContentFieldDefinition> fields =
new HashMap<>();
ContentFieldDefinition companyNameDef =
new ContentFieldDefinition();
companyNameDef.setType(ContentFieldType.STRING);
companyNameDef.setMethod(
GenerationMethod.EXTRACT);
companyNameDef.setDescription(
"Name of the company");
fields.put("company_name", companyNameDef);
ContentFieldDefinition totalAmountDef =
new ContentFieldDefinition();
totalAmountDef.setType(ContentFieldType.NUMBER);
totalAmountDef.setMethod(
GenerationMethod.EXTRACT);
totalAmountDef.setDescription(
"Total amount on the document");
fields.put("total_amount", totalAmountDef);
ContentFieldDefinition summaryDef =
new ContentFieldDefinition();
summaryDef.setType(ContentFieldType.STRING);
summaryDef.setMethod(
GenerationMethod.GENERATE);
summaryDef.setDescription(
"A brief summary of the document content");
fields.put("document_summary", summaryDef);
ContentFieldDefinition documentTypeDef =
new ContentFieldDefinition();
documentTypeDef.setType(ContentFieldType.STRING);
documentTypeDef.setMethod(
GenerationMethod.CLASSIFY);
documentTypeDef.setDescription(
"Type of document");
documentTypeDef.setEnumProperty(
Arrays.asList(
"invoice", "receipt", "contract",
"report", "other"
));
fields.put("document_type", documentTypeDef);
ContentFieldSchema fieldSchema =
new ContentFieldSchema();
fieldSchema.setName("company_schema");
fieldSchema.setDescription(
"Schema for extracting company information");
fieldSchema.setFields(fields);
Map<String, String> models = new HashMap<>();
models.put("completion", "gpt-4.1");
models.put("embedding", "text-embedding-3-large"); // Required when using field_schema and prebuilt-document base analyzer
ContentAnalyzer customAnalyzer =
new ContentAnalyzer()
.setBaseAnalyzerId("prebuilt-document")
.setDescription(
"Custom analyzer for extracting"
+ " company information")
.setConfig(new ContentAnalyzerConfig()
.setOcrEnabled(true)
.setLayoutEnabled(true)
.setFormulaEnabled(true)
.setEstimateFieldSourceAndConfidence(
true)
.setReturnDetails(true))
.setFieldSchema(fieldSchema)
.setModels(models);
SyncPoller<ContentAnalyzerOperationStatus,
ContentAnalyzer> operation =
client.beginCreateAnalyzer(
analyzerId, customAnalyzer, true);
ContentAnalyzer result =
operation.getFinalResult();
System.out.println(
"Analyzer '" + analyzerId
+ "' created successfully!");
if (result.getDescription() != null) {
System.out.println(
" Description: "
+ result.getDescription());
}
if (result.getFieldSchema() != null
&& result.getFieldSchema()
.getFields() != null) {
System.out.println(
" Fields ("
+ result.getFieldSchema()
.getFields().size() + "):");
result.getFieldSchema().getFields()
.forEach((fieldName, fieldDef) -> {
String method =
fieldDef.getMethod() != null
? fieldDef.getMethod()
.toString()
: "auto";
String type =
fieldDef.getType() != null
? fieldDef.getType()
.toString()
: "unknown";
System.out.println(
" - " + fieldName
+ ": " + type
+ " (" + method + ")");
});
}
示例输出如下所示:
Analyzer 'my_document_analyzer_ID' created successfully!
Description: Custom analyzer for extracting company information
Fields (4):
- total_amount: number (extract)
- company_name: string (extract)
- document_summary: string (generate)
- document_type: string (classify)
小窍门
此代码基于 SDK 存储库中的 “创建分析器”示例 。
(可选)可以创建分类器分析器来对文档进行分类,并使用其结果将文档路由到你创建的预生成或自定义分析器。 下面是为分类工作流创建自定义分析器的示例。
// Generate a unique analyzer ID
String classifierId =
"my_classifier_" + System.currentTimeMillis();
System.out.println(
"Creating classifier '"
+ classifierId + "'...");
// Define content categories for classification
Map<String, ContentCategoryDefinition>
categories = new HashMap<>();
categories.put("Loan_Application",
new ContentCategoryDefinition()
.setDescription(
"Documents submitted by individuals"
+ " or businesses to request funding,"
+ " typically including personal or"
+ " business details, financial"
+ " history, loan amount, purpose,"
+ " and supporting documentation."));
categories.put("Invoice",
new ContentCategoryDefinition()
.setDescription(
"Billing documents issued by sellers"
+ " or service providers to request"
+ " payment for goods or services,"
+ " detailing items, prices, taxes,"
+ " totals, and payment terms."));
categories.put("Bank_Statement",
new ContentCategoryDefinition()
.setDescription(
"Official statements issued by banks"
+ " that summarize account activity"
+ " over a period, including deposits,"
+ " withdrawals, fees,"
+ " and balances."));
// Create the classifier
Map<String, String> classifierModels =
new HashMap<>();
classifierModels.put("completion", "gpt-4.1");
ContentAnalyzer classifier =
new ContentAnalyzer()
.setBaseAnalyzerId("prebuilt-document")
.setDescription(
"Custom classifier for financial"
+ " document categorization")
.setConfig(new ContentAnalyzerConfig()
.setReturnDetails(true)
.setSegmentEnabled(true)
.setContentCategories(categories))
.setModels(classifierModels);
SyncPoller<ContentAnalyzerOperationStatus,
ContentAnalyzer> classifierOp =
client.beginCreateAnalyzer(
classifierId, classifier, true);
classifierOp.getFinalResult();
// Get the full classifier details
ContentAnalyzer classifierResult =
client.getAnalyzer(classifierId);
System.out.println(
"Classifier '" + classifierId
+ "' created successfully!");
if (classifierResult.getDescription() != null) {
System.out.println(
" Description: "
+ classifierResult.getDescription());
}
小窍门
此代码基于分类工作流的 “创建分类器”示例 。
使用自定义分析器
创建分析器后,使用它分析文档并提取自定义字段。 不再需要分析器时,请将其删除。
String documentUrl =
"https://raw.githubusercontent.com/"
+ "Azure-Samples/"
+ "azure-ai-content-understanding-assets/"
+ "main/document/invoice.pdf";
AnalysisInput input = new AnalysisInput();
input.setUrl(documentUrl);
SyncPoller<ContentAnalyzerAnalyzeOperationStatus,
AnalysisResult> analyzeOperation =
client.beginAnalyze(
analyzerId, Arrays.asList(input));
AnalysisResult analyzeResult =
analyzeOperation.getFinalResult();
if (analyzeResult.getContents() != null
&& !analyzeResult.getContents().isEmpty()
&& analyzeResult.getContents().get(0)
instanceof DocumentContent) {
DocumentContent content =
(DocumentContent) analyzeResult
.getContents().get(0);
ContentField companyField =
content.getFields() != null
? content.getFields()
.get("company_name") : null;
if (companyField
instanceof ContentStringField) {
ContentStringField sf =
(ContentStringField) companyField;
System.out.println(
"Company Name: " + sf.getValue());
System.out.println(
" Confidence: "
+ companyField.getConfidence());
}
ContentField totalField =
content.getFields() != null
? content.getFields()
.get("total_amount") : null;
if (totalField != null) {
System.out.println(
"Total Amount: "
+ totalField.getValue());
}
ContentField summaryField =
content.getFields() != null
? content.getFields()
.get("document_summary") : null;
if (summaryField
instanceof ContentStringField) {
ContentStringField sf =
(ContentStringField) summaryField;
System.out.println(
"Summary: " + sf.getValue());
}
ContentField typeField =
content.getFields() != null
? content.getFields()
.get("document_type") : null;
if (typeField
instanceof ContentStringField) {
ContentStringField sf =
(ContentStringField) typeField;
System.out.println(
"Document Type: " + sf.getValue());
}
}
// --- Clean up ---
System.out.println(
"\nCleaning up: deleting analyzer '"
+ analyzerId + "'...");
client.deleteAnalyzer(analyzerId);
System.out.println(
"Analyzer '" + analyzerId
+ "' deleted successfully.");
示例输出如下所示:
Company Name: CONTOSO LTD.
Confidence: 0.781
Total Amount: 610.0
Summary: This document is an invoice from CONTOSO LTD. to Microsoft Corporation for consulting services, document fees, and printing fees, detailing service dates, itemized charges, taxes, and the total amount due.
Document Type: invoice
Cleaning up: deleting analyzer 'my_document_analyzer_ID'...
Analyzer 'my_document_analyzer_ID' deleted successfully.
小窍门
查看有关在 Java SDK 示例上运行分析器的详细信息。
本指南介绍如何使用内容理解 JavaScript SDK 创建自定义分析器,以便从内容中提取结构化数据。 自定义分析器支持文档、图像、音频和视频内容类型。
先决条件
- 一个有效的 “Azure” 订阅。 如果没有 Azure 帐户,可免费创建一个帐户。
- 在受支持的区域中创建的Microsoft Foundry 资源。
- 资源终结点和 API 密钥(在 Azure 门户中的 密钥和终结点 下找到)。
- 为您的资源配置的模型部署默认设置。 有关安装说明,请参阅 模型和部署 或此一次性 配置脚本 。
- Node.js LTS 版本。
设置
创建新的 Node.js 项目:
mkdir custom-analyzer-tutorial cd custom-analyzer-tutorial npm init -y安装内容理解客户端库:
npm install @azure/ai-content-understanding(可选)安装用于Microsoft Entra 身份验证的 Azure 标识库:
npm install @azure/identity
设置环境变量
若要使用内容理解服务进行身份验证,请在运行示例之前使用自己的值设置环境变量:
-
CONTENTUNDERSTANDING_ENDPOINT- 内容理解资源的终结点。 -
CONTENTUNDERSTANDING_KEY- 内容理解 API 密钥(如果使用 Microsoft Entra ID DefaultAzureCredential,则为可选)。
Windows操作系统
setx CONTENTUNDERSTANDING_ENDPOINT "your-endpoint"
setx CONTENTUNDERSTANDING_KEY "your-key"
Linux/macOS
export CONTENTUNDERSTANDING_ENDPOINT="your-endpoint"
export CONTENTUNDERSTANDING_KEY="your-key"
创建客户端
const { AzureKeyCredential } =
require("@azure/core-auth");
const {
ContentUnderstandingClient,
} = require("@azure/ai-content-understanding");
const endpoint =
process.env["CONTENTUNDERSTANDING_ENDPOINT"];
const key =
process.env["CONTENTUNDERSTANDING_KEY"];
const client = new ContentUnderstandingClient(
endpoint,
new AzureKeyCredential(key)
);
创建自定义分析器
以下示例基于 预生成的文档分析器创建自定义文档分析器。 它使用三种提取方法定义字段: extract 文本文本、 generate AI 生成的摘要和 classify 分类。
const analyzerId =
`my_document_analyzer_${Math.floor(
Date.now() / 1000
)}`;
const analyzer = {
baseAnalyzerId: "prebuilt-document",
description:
"Custom analyzer for extracting"
+ " company information",
config: {
enableFormula: true,
enableLayout: true,
enableOcr: true,
estimateFieldSourceAndConfidence: true,
returnDetails: true,
},
fieldSchema: {
name: "company_schema",
description:
"Schema for extracting company"
+ " information",
fields: {
company_name: {
type: "string",
method: "extract",
description:
"Name of the company",
},
total_amount: {
type: "number",
method: "extract",
description:
"Total amount on the"
+ " document",
},
document_summary: {
type: "string",
method: "generate",
description:
"A brief summary of the"
+ " document content",
},
document_type: {
type: "string",
method: "classify",
description: "Type of document",
enum: [
"invoice", "receipt",
"contract", "report", "other",
],
},
},
},
models: {
completion: "gpt-4.1",
embedding: "text-embedding-3-large", // Required when using field_schema and prebuilt-document base analyzer
},
};
const poller = client.createAnalyzer(
analyzerId, analyzer
);
await poller.pollUntilDone();
const result = await client.getAnalyzer(
analyzerId
);
console.log(
`Analyzer '${analyzerId}' created`
+ ` successfully!`
);
if (result.description) {
console.log(
` Description: ${result.description}`
);
}
if (result.fieldSchema?.fields) {
const fields = result.fieldSchema.fields;
console.log(
` Fields`
+ ` (${Object.keys(fields).length}):`
);
for (const [name, fieldDef]
of Object.entries(fields)) {
const method =
fieldDef.method ?? "auto";
const fieldType =
fieldDef.type ?? "unknown";
console.log(
` - ${name}: `
+ `${fieldType} (${method})`
);
}
}
示例输出如下所示:
Analyzer 'my_document_analyzer_ID' created successfully!
Description: Custom analyzer for extracting company information
Fields (4):
- company_name: string (extract)
- total_amount: number (extract)
- document_summary: string (generate)
- document_type: string (classify)
小窍门
此代码基于 SDK 存储库中的 创建分析器示例 。
(可选)可以创建分类器分析器来对文档进行分类,并使用其结果将文档路由到你创建的预生成或自定义分析器。 下面是为分类工作流创建自定义分析器的示例。
const classifierId =
`my_classifier_${Math.floor(
Date.now() / 1000
)}`;
console.log(
`Creating classifier '${classifierId}'...`
);
const classifierAnalyzer = {
baseAnalyzerId: "prebuilt-document",
description:
"Custom classifier for financial"
+ " document categorization",
config: {
returnDetails: true,
enableSegment: true,
contentCategories: {
Loan_Application: {
description:
"Documents submitted by"
+ " individuals or"
+ " businesses to request"
+ " funding, typically"
+ " including personal or"
+ " business details,"
+ " financial history,"
+ " loan amount, purpose,"
+ " and supporting"
+ " documentation.",
},
Invoice: {
description:
"Billing documents issued"
+ " by sellers or service"
+ " providers to request"
+ " payment for goods or"
+ " services, detailing"
+ " items, prices, taxes,"
+ " totals, and payment"
+ " terms.",
},
Bank_Statement: {
description:
"Official statements"
+ " issued by banks that"
+ " summarize account"
+ " activity over a"
+ " period, including"
+ " deposits, withdrawals,"
+ " fees, and balances.",
},
},
},
models: {
completion: "gpt-4.1",
},
};
const classifierPoller =
client.createAnalyzer(
classifierId, classifierAnalyzer
);
await classifierPoller.pollUntilDone();
const classifierResult =
await client.getAnalyzer(classifierId);
console.log(
`Classifier '${classifierId}' created`
+ ` successfully!`
);
if (classifierResult.description) {
console.log(
` Description: `
+ `${classifierResult.description}`
);
}
小窍门
此代码基于分类工作流的 创建分类器示例 。
使用自定义分析器
创建分析器后,使用它分析文档并提取自定义字段。 不再需要分析器时,请将其删除。
const documentUrl =
"https://raw.githubusercontent.com/"
+ "Azure-Samples/"
+ "azure-ai-content-understanding-assets/"
+ "main/document/invoice.pdf";
const analyzePoller = client.analyze(
analyzerId, [{ url: documentUrl }]
);
const analyzeResult =
await analyzePoller.pollUntilDone();
if (analyzeResult.contents
&& analyzeResult.contents.length > 0) {
const content = analyzeResult.contents[0];
if (content.fields) {
const company =
content.fields["company_name"];
if (company) {
console.log(
`Company Name: `
+ `${company.value}`
);
console.log(
` Confidence: `
+ `${company.confidence}`
);
}
const total =
content.fields["total_amount"];
if (total) {
console.log(
`Total Amount: `
+ `${total.value}`
);
}
const summary =
content.fields["document_summary"];
if (summary) {
console.log(
`Summary: ${summary.value}`
);
}
const docType =
content.fields["document_type"];
if (docType) {
console.log(
`Document Type: `
+ `${docType.value}`
);
}
}
}
// --- Clean up ---
console.log(
`\nCleaning up: deleting analyzer`
+ ` '${analyzerId}'...`
);
await client.deleteAnalyzer(analyzerId);
console.log(
`Analyzer '${analyzerId}' deleted`
+ ` successfully.`
);
示例输出如下所示:
Company Name: CONTOSO LTD.
Confidence: 0.739
Total Amount: 610
Summary: This document is an invoice from CONTOSO LTD. to Microsoft Corporation for consulting, document, and printing services provided during the service period. It details line items, subtotal, sales tax, total, previous unpaid balance, and the final amount due.
Document Type: invoice
Cleaning up: deleting analyzer 'my_document_analyzer_ID'...
Analyzer 'my_document_analyzer_ID' deleted successfully.
小窍门
请查看有关在 JavaScript SDK 示例上运行分析器的详细信息。
本指南介绍如何使用内容理解 TypeScript SDK 创建自定义分析器,以便从内容中提取结构化数据。 自定义分析器支持文档、图像、音频和视频内容类型。
先决条件
- 一个有效的 “Azure” 订阅。 如果没有 Azure 帐户,可免费创建一个帐户。
- 在受支持的区域中创建的Microsoft Foundry 资源。
- 资源终结点和 API 密钥(在 Azure 门户中的 密钥和终结点 下找到)。
- 为您的资源配置的模型部署默认设置。 有关安装说明,请参阅 模型和部署 或此一次性 配置脚本 。
- Node.js LTS 版本。
- TypeScript 5.x 或更高版本。
设置
创建新的 Node.js 项目:
mkdir custom-analyzer-tutorial cd custom-analyzer-tutorial npm init -y安装 TypeScript 和内容理解客户端库:
npm install typescript ts-node @azure/ai-content-understanding(可选)安装用于Microsoft Entra 身份验证的 Azure 标识库:
npm install @azure/identity
设置环境变量
若要使用内容理解服务进行身份验证,请在运行示例之前使用自己的值设置环境变量:
-
CONTENTUNDERSTANDING_ENDPOINT- 内容理解资源的终结点。 -
CONTENTUNDERSTANDING_KEY- 内容理解 API 密钥(如果使用 Microsoft Entra ID DefaultAzureCredential,则为可选)。
Windows操作系统
setx CONTENTUNDERSTANDING_ENDPOINT "your-endpoint"
setx CONTENTUNDERSTANDING_KEY "your-key"
Linux/macOS
export CONTENTUNDERSTANDING_ENDPOINT="your-endpoint"
export CONTENTUNDERSTANDING_KEY="your-key"
创建客户端
import { AzureKeyCredential } from "@azure/core-auth";
import {
ContentUnderstandingClient,
} from "@azure/ai-content-understanding";
import type {
ContentAnalyzer,
ContentAnalyzerConfig,
ContentFieldSchema,
} from "@azure/ai-content-understanding";
const endpoint =
process.env["CONTENTUNDERSTANDING_ENDPOINT"]!;
const key =
process.env["CONTENTUNDERSTANDING_KEY"]!;
const client = new ContentUnderstandingClient(
endpoint,
new AzureKeyCredential(key)
);
创建自定义分析器
以下示例基于 预生成的文档分析器创建自定义文档分析器。 它使用三种提取方法定义字段: extract 文本文本、 generate AI 生成的摘要和 classify 分类。
const analyzerId =
`my_document_analyzer_${Math.floor(
Date.now() / 1000
)}`;
const fieldSchema: ContentFieldSchema = {
name: "company_schema",
description:
"Schema for extracting company"
+ " information",
fields: {
company_name: {
type: "string",
method: "extract",
description:
"Name of the company",
},
total_amount: {
type: "number",
method: "extract",
description:
"Total amount on the document",
},
document_summary: {
type: "string",
method: "generate",
description:
"A brief summary of the"
+ " document content",
},
document_type: {
type: "string",
method: "classify",
description: "Type of document",
enum: [
"invoice", "receipt",
"contract", "report", "other",
],
},
},
};
const config: ContentAnalyzerConfig = {
enableFormula: true,
enableLayout: true,
enableOcr: true,
estimateFieldSourceAndConfidence: true,
returnDetails: true,
};
const analyzer: ContentAnalyzer = {
baseAnalyzerId: "prebuilt-document",
description:
"Custom analyzer for extracting"
+ " company information",
config,
fieldSchema,
models: {
completion: "gpt-4.1",
embedding: "text-embedding-3-large", // Required when using field_schema and prebuilt-document base analyzer
},
} as unknown as ContentAnalyzer;
const poller = client.createAnalyzer(
analyzerId, analyzer
);
await poller.pollUntilDone();
const result = await client.getAnalyzer(
analyzerId
);
console.log(
`Analyzer '${analyzerId}' created`
+ ` successfully!`
);
if (result.description) {
console.log(
` Description: ${result.description}`
);
}
if (result.fieldSchema?.fields) {
const fields = result.fieldSchema.fields;
console.log(
` Fields`
+ ` (${Object.keys(fields).length}):`
);
for (const [name, fieldDef]
of Object.entries(fields)) {
const method =
fieldDef.method ?? "auto";
const fieldType =
fieldDef.type ?? "unknown";
console.log(
` - ${name}: `
+ `${fieldType} (${method})`
);
}
}
示例输出如下所示:
Analyzer 'my_document_analyzer_ID' created successfully!
Description: Custom analyzer for extracting company information
Fields (4):
- company_name: string (extract)
- total_amount: number (extract)
- document_summary: string (generate)
- document_type: string (classify)
小窍门
此代码基于 SDK 存储库中的 创建分析器示例 。
(可选)可以创建分类器分析器来对文档进行分类,并使用其结果将文档路由到你创建的预生成或自定义分析器。 下面是为分类工作流创建自定义分析器的示例。
const classifierId =
`my_classifier_${Math.floor(
Date.now() / 1000
)}`;
console.log(
`Creating classifier '${classifierId}'...`
);
const classifierAnalyzer: ContentAnalyzer = {
baseAnalyzerId: "prebuilt-document",
description:
"Custom classifier for financial"
+ " document categorization",
config: {
returnDetails: true,
enableSegment: true,
contentCategories: {
Loan_Application: {
description:
"Documents submitted by"
+ " individuals or"
+ " businesses to request"
+ " funding, typically"
+ " including personal or"
+ " business details,"
+ " financial history,"
+ " loan amount, purpose,"
+ " and supporting"
+ " documentation.",
},
Invoice: {
description:
"Billing documents issued"
+ " by sellers or service"
+ " providers to request"
+ " payment for goods or"
+ " services, detailing"
+ " items, prices, taxes,"
+ " totals, and payment"
+ " terms.",
},
Bank_Statement: {
description:
"Official statements"
+ " issued by banks that"
+ " summarize account"
+ " activity over a"
+ " period, including"
+ " deposits, withdrawals,"
+ " fees, and balances.",
},
},
} as unknown as ContentAnalyzerConfig,
models: {
completion: "gpt-4.1",
},
} as unknown as ContentAnalyzer;
const classifierPoller =
client.createAnalyzer(
classifierId, classifierAnalyzer
);
await classifierPoller.pollUntilDone();
const classifierResult =
await client.getAnalyzer(classifierId);
console.log(
`Classifier '${classifierId}' created`
+ ` successfully!`
);
if (classifierResult.description) {
console.log(
` Description: `
+ `${classifierResult.description}`
);
}
小窍门
此代码基于分类工作流的 创建分类器示例 。
使用自定义分析器
创建分析器后,使用它分析文档并提取自定义字段。 不再需要分析器时,请将其删除。
const documentUrl =
"https://raw.githubusercontent.com/"
+ "Azure-Samples/"
+ "azure-ai-content-understanding-assets/"
+ "main/document/invoice.pdf";
const analyzePoller = client.analyze(
analyzerId, [{ url: documentUrl }]
);
const analyzeResult =
await analyzePoller.pollUntilDone();
if (analyzeResult.contents
&& analyzeResult.contents.length > 0) {
const content = analyzeResult.contents[0];
if (content.fields) {
const company =
content.fields["company_name"];
if (company) {
console.log(
`Company Name: `
+ `${company.value}`
);
console.log(
` Confidence: `
+ `${company.confidence}`
);
}
const total =
content.fields["total_amount"];
if (total) {
console.log(
`Total Amount: `
+ `${total.value}`
);
}
const summary =
content.fields["document_summary"];
if (summary) {
console.log(
`Summary: ${summary.value}`
);
}
const docType =
content.fields["document_type"];
if (docType) {
console.log(
`Document Type: `
+ `${docType.value}`
);
}
}
}
// --- Clean up ---
console.log(
`\nCleaning up: deleting analyzer`
+ ` '${analyzerId}'...`
);
await client.deleteAnalyzer(analyzerId);
console.log(
`Analyzer '${analyzerId}' deleted`
+ ` successfully.`
);
示例输出如下所示:
Company Name: CONTOSO LTD.
Confidence: 0.818
Total Amount: 610
Summary: This document is an invoice from CONTOSO LTD. to MICROSOFT CORPORATION for consulting, document, and printing services provided during the service period 10/14/2019 - 11/14/2019. It details line items, subtotal, sales tax, total, previous unpaid balance, and the final amount due.
Document Type: invoice
Cleaning up: deleting analyzer 'my_document_analyzer_ID'...
Analyzer 'my_document_analyzer_ID' deleted successfully.
小窍门
请查看 在 TypeScript SDK 示例上运行分析器的详细信息。
相关内容
- 查看代码示例:可视化文档搜索。
- 查看代码示例:分析器模板。
- 浏览更多 Python SDK 示例
- 浏览更多 .NET SDK 示例
- 浏览更多 Java SDK 示例
- 浏览更多 JavaScript SDK 示例
- 浏览更多 TypeScript SDK 示例
- 尝试使用 Foundry 的内容理解功能来处理文档内容。