指引 LLM 評審使用通過與不通過自然語言標準來評估生成式AI輸出。 他們擅長評估:
- 合規性:“不得包含定價資訊”
- 風格/語氣:“保持專業、同理心的語氣”
- 需求:“必須包含特定免責聲明”
- 正確性:“僅使用提供內容中的事實”
如需詳細文件與更多範例,請參閱 MLflow 評審指南文件。
指導方針 LLM 評審提供以下好處:
- 商務友好:領域專家撰寫準則而不撰寫程序代碼
- 彈性:不變更程式碼的更新準則
- 可解譯:明確的通過/失敗條件
- 快速反覆專案:快速測試新準則
使用指南評委的方法
MLflow 提供下列指導方針評委:
-
內建
Guidelines()判斷:將全域準則統一套用至所有行。 只評估應用程式的輸入與輸出。 可在離線評估和生產監視中運作。 -
內建
ExpectationsGuidelines()判斷器:在評估資料集中使用由領域專家標記的每列規則。 只評估應用程式的輸入與輸出。 僅適用於離線評估。
欲了解 API 細節,請參閱 MLflow 文件:
指導方針的運作方式
指南評委使用經過特殊調整的 LLM 來評估文字是否符合您指定的標準。 法官:
-
接收上下文:任何包含要評估資料的 JSON 字典(例如:
request、responseretrieved_documents、ruser_preferences)。 你可以直接在指引中以名稱引用這些鑰匙。 參見 參考上下文變數 - 套用指引:你的自然語言規則,定義通過/不通過條件。
- 做出判斷:回傳一個二元的通過/不通過分數,並附上詳細理由。
欲了解更多驅動 LLM 法官的模型資訊,請參閱「 關於驅動 LLM 法官的模型資訊」。
您可以在法官定義中使用 model 引數來變更法官模型。 模型必須以格式 <provider>:/<model-name>指定,其中 <provider> 是 LiteLLM 相容的模型提供者。 如果您用作 databricks 模型提供者,則模型名稱與服務端點名稱相同。
執行範例的必要條件
安裝 MLflow 和必要的套件。
%pip install --upgrade "mlflow[databricks]>=3.4.0" dbutils.library.restartPython()按照 設定環境的快速入門指南,來建立 MLflow 實驗。
Guidelines() 評委:全球準則
法官在您的 Guidelines 評估中的所有行或生產監控中的痕跡中應用統一的指南。 它會自動從你的追蹤中擷取請求與回應資料,並依照你的指引進行評估。
在您的指導方針中,將應用程式的輸入稱為 , request 而應用程式的輸出則為 response。 以下程式碼建立了一些簡單的指引。
from mlflow.genai.scorers import Guidelines
import mlflow
# Example data
data = [
{
"inputs": {"question": "What is the capital of France?"},
"outputs": {"response": "The capital of France is Paris."}
},
{
"inputs": {"question": "What is the capital of Germany?"},
"outputs": {"response": "The capital of Germany is Berlin."}
}
]
# Create scorers with global guidelines
english = Guidelines(
name="english",
guidelines=["The response must be in English"]
)
clarity = Guidelines(
name="clarity",
guidelines=["The response must be clear, coherent, and concise"],
model="databricks:/databricks-gpt-oss-120b", # Optional custom judge model
)
# Evaluate with global guidelines
results = mlflow.genai.evaluate(
data=data,
scorers=[english, clarity]
)
以下範例建立包含樣本輸入與輸出的評估資料集。 接著,它會定義並執行一套指引,用以判斷回應的語氣。
from mlflow.genai.scorers import Guidelines
import mlflow
# Create evaluation dataset with pre-computed outputs
eval_dataset = [
{
"inputs": {
"messages": [{"role": "user", "content": "My order hasn't arrived yet"}]
},
"outputs": {
"choices": [{
"message": {
"content": "I understand your concern about the delayed order. Let me help you track it right away."
}
}]
},
},
{
"inputs": {
"messages": [{"role": "user", "content": "How do I reset my password?"}]
},
"outputs": {
"choices": [{
"message": {
"content": "To reset your password, click 'Forgot Password' on the login page. You'll receive an email within 5 minutes."
}
}]
},
}
]
# Run evaluation on existing outputs
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[Guidelines(name="tone", guidelines="The response must maintain a courteous, respectful tone throughout. It must show empathy for customer concerns in the request"),]
)
參數
| 參數 | 類型 | 為必填項目 | 說明 |
|---|---|---|---|
name |
str |
是的 | 評審姓名,顯示在評估結果中 |
guidelines |
str \| list[str] |
是的 | 統一應用於所有列的準則 |
model |
str |
否 | 自訂法官模型 |
指引如何判斷你應用程式的輸入與輸出
指引判定會自動從你的追蹤紀錄中擷取資料,以建立指引的上下文,並使用鍵 request 與 response。
請求
從提供的 inputs 中提取 request 場。
- 若
inputs包含一個messages鍵,並且該鍵內含有一組 OpenAI 格式的聊天訊息數組,則:- 如果只有一個訊息,那麼
request就是該訊息的content。 - 如果有多個訊息,
request是整個訊息陣列,已序列化為 JSON 字串。
- 如果只有一個訊息,那麼
- 否則,
request就是整個inputs字典序列化為 JSON 字串。
請提供範例
單一訊息輸入:
# Input
inputs = {
"messages": [
{"role": "user", "content": "How can I reset my password?"}
]
}
# Parsed request
"How can I reset my password?"
多回合交談:
# Input
inputs = {
"messages": [
{"role": "user", "content": "What is MLflow?"},
{"role": "assistant", "content": "MLflow is an open source platform..."},
{"role": "user", "content": "Tell me more about tracing"}
]
}
# Parsed request (JSON string)
'[{"role": "user", "content": "What is MLflow?"}, {"role": "assistant", "content": "MLflow is an open source platform..."}, {"role": "user", "content": "Tell me more about tracing"}]'
任意字典:
# Input
inputs = {"key1": "Explain MLflow evaluation", "key2": "something else"}
# Parsed request
'{"key1": "Explain MLflow evaluation", "key2": "something else"}'
回應
從提供的 outputs中提取response該欄位:
- 如果
outputs包含 OpenAI 格式的 ChatCompletions 物件:-
response是第一選擇的content
-
- 如果
outputs包含messages一個帶有 OpenAI 格式聊天訊息陣列的金鑰-
response是最後一則訊息的content
-
- 否則,
response會outputs序列化為 JSON 字串。
回應範例
ChatCompletion 輸出:
# Output (simplified)
outputs = {
"choices": [{
"message": {
"content": "MLflow evaluation helps measure GenAI quality..."
}
}]
}
# Parsed response
"MLflow evaluation helps measure GenAI quality..."
訊息格式輸出:
# Output
outputs = {
"messages": [
{"role": "user", "content": "What are the ..."}
{"role": "assistant", "content": "Here are the key features..."}
]
}
# Parsed response
"Here are the key features..."
任意字典:
# Input
inputs = {"key1": "Explain MLflow evaluation", "key2": "something else"}
# Parsed request
'{"key1": "Explain MLflow evaluation", "key2": "something else"}'
ExpectationsGuidelines() 評分準則:逐行指引
ExpectationsGuidelines法官根據由領域專家制定的針對行的特定指南進行評估。 這可讓您針對數據集中的每個範例使用不同的評估準則。
使用時機
在以下情況下使用此判斷:
- 您有已使用自定義指導方針標記特定範例的領域專家
- 不同的數據列需要不同的評估準則
範例
在您的指導方針中,將應用程式的輸入稱為 , request 而應用程式的輸出則為 response。
from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow
# Dataset with per-row guidelines
data = [
{
"inputs": {"question": "What is the capital of France?"},
"outputs": "The capital of France is Paris.",
"expectations": {
"guidelines": ["The response must be factual and concise"]
}
},
{
"inputs": {"question": "How to learn Python?"},
"outputs": "You can read a book or take a course.",
"expectations": {
"guidelines": ["The response must be helpful and encouraging"]
}
}
]
# Evaluate with per-row guidelines
results = mlflow.genai.evaluate(
data=data,
scorers=[ExpectationsGuidelines()]
)
傳回值
準則評審會傳回包含 mlflow.entities.Feedback 以下內容的物件:
-
value:"yes"(符合指引)或"no"(未通過指引) -
rationale:詳細說明內容傳遞或失敗的原因 -
name:評估名稱(提供或自動產生) -
error:評估失敗時的錯誤詳細數據
撰寫有效指引的最佳實務
撰寫良好的指導方針對於精確的評估至關重要。 本節說明撰寫指引時的最佳實務。
參考內容變數
請將上下文字典中的任何關鍵詞直接納入指導方針中。
# Example 1: Validate against retrieved documents
context = {
"request": "What is the refund policy?",
"response": "You can return items within 30 days for a full refund.",
"retrieved_documents": ["Policy: Returns accepted within 30 days", "Policy: No refunds after 30 days"]
}
guideline = "The response must only include information from retrieved_documents"
# Example 2: Check user preferences
context = {
"request": "Recommend a restaurant",
"response": "I suggest trying the new steakhouse downtown",
"user_preferences": {"dietary_restrictions": "vegetarian", "cuisine": "Italian"}
}
guideline = "The response must respect user_preferences when making recommendations"
# Example 3: Enforce business rules
context = {
"request": "Can you apply a discount?",
"response": "I've applied a 15% discount to your order",
"max_allowed_discount": 10,
"user_tier": "silver"
}
guideline = "The response must not exceed max_allowed_discount for the user_tier"
# Example 4: Multiple constraints
context = {
"request": "Tell me about product features",
"response": "This product includes features A, B, and C",
"approved_features": ["A", "B", "C", "D"],
"deprecated_features": ["X", "Y", "Z"]
}
guideline = """The response must:
- Only mention approved_features
- Not include deprecated_features"""
其他指導方針
要具體且可✅ 衡量「回應中不得包含具體的價格金額或百分比」 ❌ 「不要談論金錢」
使用明確的通過/不通過條件✅ 「若被問及價格,回答必須引導用戶至價格頁面」「 ❌ 適當處理定價問題」
明確引用上下文✅「回應必須只使用retrieved_context中存在的事實」 ❌ 「必須是事實性的」
結構複雜需求
guideline = """The response must:
- Include a greeting if first message
- Address the user's specific question
- End with an offer to help further
- Not exceed 150 words"""
真實世界範例
客戶服務聊天機器人
以下是評估不同案例中客戶服務聊天機器人的實際指導方針範例:
全球所有互動的指導方針
from mlflow.genai.scorers import Guidelines
import mlflow
# Define global standards for all customer interactions
tone_guidelines = Guidelines(
name="customer_service_tone",
guidelines="""The response must maintain our brand voice which is:
- Professional yet warm and conversational (avoid corporate jargon)
- Empathetic, acknowledging emotional context before jumping to solutions
- Proactive in offering help without being pushy
Specifically:
- If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
- The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
- The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
- The response must end with a specific next step or open-ended offer to help, not generic closings"""
)
compliance_guidelines = Guidelines(
name="policy_compliance",
guidelines=[
"""Refund and return policies:
- The response must not promise refunds beyond the 30-day return window
- The response must mention restocking fees for electronics (15%) if discussing returns
- For items over $500, the response must note that manager approval is required
- The response must not waive shipping fees without checking customer loyalty tier""",
"""Data privacy and security:
- The response must never ask for full credit card numbers, SSN, or passwords
- The response must not reference other customers' orders or information
- When discussing order details, the response must only reference the last 4 digits of payment methods
- The response must direct customers to secure portal for updating payment information""",
"""Commitment limitations:
- The response must not guarantee specific delivery dates without system verification
- The response must include disclaimer about "estimated" times when discussing shipping
- The response must not promise price matches without mentioning the verification process
- For out-of-stock items, the response must not commit to restock dates"""
]
)
# Example customer service data
customer_interactions = [
{
"inputs": {
"messages": [{"role": "user", "content": "I've been waiting 2 hours on hold! This is ridiculous!"}]
},
"outputs": {
"choices": [{
"message": {
"content": "I completely understand your frustration about the long wait time. That's absolutely not the experience we want you to have. I'm here to help you right now - what can I assist you with today?"
}
}]
}
},
{
"inputs": {
"messages": [{"role": "user", "content": "Can I return my laptop? I bought it 45 days ago."}]
},
"outputs": {
"choices": [{
"message": {
"content": "I'd be happy to look into return options for you. Since our standard return window is 30 days and your purchase was 45 days ago, I'll need to check if there are any exceptions we can make. For electronics like laptops, we do have a 15% restocking fee. May I have your order number to review your specific situation?"
}
}]
}
}
]
# Evaluate customer service interactions
results = mlflow.genai.evaluate(
data=customer_interactions,
scorers=[tone_guidelines, compliance_guidelines]
)
特定情境下逐行的指導方針
from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow
# Dataset with scenario-specific guidelines
customer_service_data = [
{
"inputs": {
"messages": [{"role": "user", "content": "My package is 3 weeks late and I need it for an event tomorrow!"}]
},
"outputs": {
"choices": [{
"message": {
"content": "I can absolutely understand how stressful this must be, especially with your event tomorrow. I'm so sorry your package has been delayed for 3 weeks - that's completely unacceptable. Let me help you immediately. I'm checking our local store inventory right now for same-day pickup options. I'm also processing a full refund of your shipping costs and applying a 25% discount to your order for this inconvenience. Would you prefer overnight delivery to arrive by 10 AM tomorrow, or would same-day pickup from our downtown location work better for you? I'm also connecting you with our escalation team - you'll receive a call within the next hour at the number on your account."
}
}]
},
"expectations": {
"guidelines": [
"""The response must handle this urgent situation with exceptional care:
- First acknowledge both the delay AND the urgent tomorrow deadline
- Express genuine empathy using phrases like "I can absolutely understand how stressful this must be"
- Must NOT make excuses or blame shipping partners""",
"""The response must provide immediate actionable solutions:
- Offer to check local store availability for same-day pickup
- If available, offer overnight shipping at no charge given the delay
- Provide direct escalation path: manager callback within 2 hours
- Include the executive customer service email for formal complaints""",
"""The response must include compensation options without being asked:
- Mention automatic refund of original shipping costs
- Offer 20-30% discount on current order due to inconvenience
- Provide credit for future purchase if customer prefers"""
]
}
},
{
"inputs": {
"messages": [{"role": "user", "content": "How do I cancel my subscription?"}]
},
"outputs": {
"choices": [{
"message": {
"content": "I can help you cancel your subscription right away. Here's how: 1) Click 'Account Settings' in the top right corner, 2) Select 'Subscription' from the menu, 3) Click 'Cancel Subscription' and confirm. Your access will continue until March 15, 2024 (the end of your current billing period), and you won't be charged again. All your saved data and preferences will be kept for 30 days if you decide to return. You're entitled to a prorated refund of $12.50 for the unused portion of this month. Alternatively, if you'd prefer, I can offer you a 50% discount for the next 3 months or help you switch to our Basic plan which is $4.99/month."
}
}]
},
"expectations": {
"guidelines": [
"""The response must respect the customer's decision to cancel:
- No guilt-inducing language or excessive retention attempts
- No hiding the cancellation process behind multiple steps
- Must not require phone calls if account was created online
- Acknowledge their right to cancel immediately without questions""",
"""The response must provide complete cancellation information:
- State the exact steps (maximum 3) to cancel online
- Clarify the last billing date and when access ends
- Explain what happens to their data, saved preferences, and history
- Mention the 30-day reactivation window where data is preserved
- Note any partial month refunds they're entitled to""",
"""The response may include ONE gentle retention offer:
- Can mention a pause option instead of cancellation
- Can offer a one-time discount or downgrade option
- Must present this as optional, after the cancellation instructions
- Must use soft language like 'alternatively' or 'if you'd prefer'"""
]
}
},
{
"inputs": {
"messages": [{"role": "user", "content": "I was charged twice for my order #12345"}]
},
"outputs": {
"choices": [{
"message": {
"content": "I see the duplicate charge on your order #12345, and I'll resolve this for you right away. I've initiated the refund for the duplicate charge of $157.99. For credit cards, you'll see this reflected in 3-5 business days, and you should see a pending reversal on your account within 24 hours. I'm sending you a confirmation email right now with the refund transaction ID (REF-789012) and all the details. Your case reference number is CS-456789 if you need to follow up. Since this occurred near month-end, if this causes any overdraft fees, please let us know - we'll reimburse up to $35 in bank fees. Our billing team's direct line is 1-800-555-0123 ext 2 if you need immediate assistance. This won't affect your credit or any future orders with us, and we're investigating our payment system to prevent this from happening again."
}
}]
},
"expectations": {
"guidelines": [
"""The response must immediately validate the customer's concern:
- Acknowledge the duplicate charge without skepticism
- Must not ask for proof or screenshots initially
- Express understanding of the inconvenience and potential financial impact
- Take ownership with phrases like 'I'll resolve this for you right away'""",
"""The response must provide specific resolution details:
- State exact refund timeline (e.g., '3-5 business days for credit cards, 5-7 for debit')
- Mention that they'll see a pending reversal within 24 hours
- Offer to send detailed confirmation email with transaction IDs
- Provide a reference number for this billing dispute
- Include the direct billing department contact for follow-up""",
"""The response must address potential concerns proactively:
- If near month-end, acknowledge potential impact on their budget
- Offer to provide a letter for their bank if overdraft fees occurred
- Mention our overdraft reimbursement policy (up to $35)
- Assure that this won't affect their credit or future orders
- Note that we're investigating to prevent future occurrences"""
]
}
}
]
results = mlflow.genai.evaluate(
data=customer_service_data,
scorers=[ExpectationsGuidelines()]
)
檔擷取應用程式
以下是評估檔擷取應用程式的實際指導方針範例:
擷取品質的全域指導方針
from mlflow.genai.scorers import Guidelines
import mlflow
# Define extraction accuracy standards
extraction_accuracy = Guidelines(
name="extraction_accuracy",
guidelines=[
"""Field extraction completeness and accuracy:
- The response must extract ALL requested fields, using exact values from source
- For ambiguous data, the response must extract the most likely value and include a confidence score
- When multiple values exist for one field (e.g., multiple addresses), extract all and label them
- Preserve original formatting for IDs, reference numbers, and codes (including leading zeros)
- For missing fields, use null with reason: {"field": null, "reason": "not_found"} """,
"""Numerical and financial data handling:
- Currency values must preserve exact decimal places as shown in source
- Must differentiate between currencies if multiple are present (USD, EUR, etc.)
- Percentage values must clarify if they're decimals (0.15) or percentages (15%)
- For calculated fields (totals, tax), must match source exactly - no recalculation
- Negative values must be preserved with proper notation (-$100 or ($100))""",
"""Entity recognition and validation:
- Company names must be extracted exactly as written (including suffixes like Inc., LLC)
- Person names must preserve original order and formatting
- Must not merge similar entities (e.g., "John Smith" and "J. Smith" are kept separate)
- Email addresses and phone numbers must be validated for basic format
- Physical addresses must include all components present in source"""
]
)
format_compliance = Guidelines(
name="output_format",
guidelines="""Output structure must meet these enterprise data standards:
JSON Structure Requirements:
- Must be valid JSON that passes strict parsing
- All field names must use snake_case consistently
- Nested objects must maintain hierarchy from source document
- Arrays must be used for multiple values, never concatenated strings
Data Type Standards:
- Dates: ISO 8601 format (YYYY-MM-DD) with timezone if available
- Timestamps: ISO 8601 with time (YYYY-MM-DDTHH:MM:SSZ)
- Currency: {"amount": 1234.56, "currency": "USD", "formatted": "$1,234.56"}
- Phone: {"number": "+14155551234", "formatted": "(415) 555-1234", "type": "mobile"}
- Boolean: true/false (not "yes"/"no" or 1/0)
Metadata Requirements:
- Include extraction_timestamp in UTC
- Include source_page for multi-page documents
- Include confidence_score (0-1) for each ML-extracted field
- Include validation_flags array for any data quality issues detected"""
)
# Example document extraction data
extraction_tasks = [
{
"inputs": {
"document_text": "Invoice #INV-2024-001\nDate: 2024-01-15\nBill To: Acme Corp\n123 Main St, Suite 100\nAnytown, CA 94000\n\nItems:\n- Widget Pro (SKU: WP-100) - Qty: 10 x $50.00 = $500.00\n- Service Fee - $100.00\n\nSubtotal: $600.00\nTax (8.75%): $52.50\nTotal: $652.50\n\nDue Date: 2024-02-15\nPayment Terms: Net 30",
"fields_to_extract": ["invoice_number", "customer", "total_amount", "due_date", "line_items"]
},
"outputs": {
"invoice_number": "INV-2024-001",
"customer": {
"name": "Acme Corp",
"address": {
"street": "123 Main St, Suite 100",
"city": "Anytown",
"state": "CA",
"zip": "94000"
}
},
"total_amount": {
"amount": 652.50,
"currency": "USD",
"formatted": "$652.50"
},
"due_date": "2024-02-15",
"line_items": [
{
"description": "Widget Pro",
"sku": "WP-100",
"quantity": 10,
"unit_price": 50.00,
"total": 500.00
},
{
"description": "Service Fee",
"quantity": 1,
"unit_price": 100.00,
"total": 100.00
}
],
"extraction_timestamp": "2024-01-20T10:30:00Z",
"source_page": 1,
"confidence_score": 0.95
}
},
{
"inputs": {
"document_text": "Contract between TechStart Inc. and CloudProvider LLC\nEffective Date: January 1, 2024\nContract ID: C-2024-789\n\nThis agreement outlines cloud hosting services...\nMonthly Fee: €5,000\nContract Term: 24 months\nCancellation: 90 days written notice required",
"fields_to_extract": ["contract_id", "parties", "monthly_fee", "term_length"]
},
"outputs": {
"contract_id": "C-2024-789",
"parties": [
{"name": "TechStart Inc.", "role": "customer"},
{"name": "CloudProvider LLC", "role": "provider"}
],
"monthly_fee": {
"amount": 5000.00,
"currency": "EUR",
"formatted": "€5,000"
},
"term_length": {
"duration": 24,
"unit": "months"
},
"cancellation_notice": {
"days": 90,
"type": "written"
},
"extraction_timestamp": "2024-01-20T10:35:00Z",
"confidence_score": 0.92
}
}
]
# Evaluate document extractions
results = mlflow.genai.evaluate(
data=extraction_tasks,
scorers=[extraction_accuracy, format_compliance]
)
針對文件類型的逐列指導方針
from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow
# Dataset with document-type specific guidelines
document_extraction_data = [
{
"inputs": {
"document_type": "invoice",
"document_text": "Invoice #INV-2024-001\nBill To: Acme Corp\nAmount: $1,234.56\nDue Date: 2024-03-15"
},
"outputs": {
"invoice_number": "INV-2024-001",
"customer": "Acme Corp",
"total_amount": 1234.56,
"due_date": "2024-03-15"
},
"expectations": {
"guidelines": [
"""Invoice identification and classification:
- Must extract invoice_number preserving exact format including prefixes/suffixes
- Must identify invoice type (standard, credit memo, proforma) if specified
- Must extract both invoice date and due date, calculating days until due
- Must identify if this is a partial, final, or supplementary invoice
- For recurring invoices, must extract frequency and period covered""",
"""Financial data extraction and validation:
- Line items must be extracted as array with: description, quantity, unit_price, total
- Must identify and separate: subtotal, tax amounts (with rates), shipping, discounts
- Currency must be identified explicitly, not assumed to be USD
- For discounts, must specify if percentage or fixed amount and what it applies to
- Payment terms must be extracted (e.g., "Net 30", "2/10 Net 30")
- Must flag any mathematical inconsistencies between line items and totals""",
"""Vendor and customer information:
- Must extract complete billing and shipping addresses as separate objects
- Company names must include any DBA ("doing business as") variations
- Must extract tax IDs, business registration numbers if present
- Contact information must be categorized (billing contact vs. delivery contact)
- Must preserve any customer account numbers or reference codes"""
]
}
},
{
"inputs": {
"document_type": "contract",
"document_text": "This agreement between Party A and Party B commences on January 1, 2024..."
},
"outputs": {
"parties": ["Party A", "Party B"],
"effective_date": "2024-01-01",
"term_length": "Not specified"
},
"expectations": {
"guidelines": [
"""Party identification and roles:
- Must extract all parties with their full legal names and entity types (Inc., LLC, etc.)
- Must identify party roles (buyer/seller, licensee/licensor, employer/employee)
- Must extract any parent company relationships or guarantors mentioned
- Must capture all representatives, their titles, and authority to sign
- Must identify jurisdiction for each party if specified""",
"""Critical dates and terms extraction:
- Must differentiate between: execution date, effective date, and expiration date
- Must extract notice periods for termination (e.g., "30 days written notice")
- Must identify any automatic renewal clauses and their conditions
- Must extract all milestone dates and deliverable deadlines
- For amendments, must note which version/date of original contract is modified""",
"""Obligations and risk analysis:
- Must extract all payment terms, amounts, and schedules
- Must identify liability caps, indemnification clauses, and insurance requirements
- Must flag any non-standard clauses that deviate from typical contracts
- Must extract all conditions precedent and subsequent
- Must identify dispute resolution mechanism (arbitration, litigation, jurisdiction)
- Must extract any non-compete, non-solicitation, or confidentiality periods"""
]
}
},
{
"inputs": {
"document_type": "medical_record",
"document_text": "Patient: John Doe\nDOB: 1985-06-15\nDiagnosis: Type 2 Diabetes\nMedications: Metformin 500mg"
},
"outputs": {
"patient_name": "John Doe",
"date_of_birth": "1985-06-15",
"diagnoses": ["Type 2 Diabetes"],
"medications": [{"name": "Metformin", "dosage": "500mg"}]
},
"expectations": {
"guidelines": [
"""HIPAA compliance and privacy protection:
- Must never extract full SSN (only last 4 digits if needed for matching)
- Must never include full insurance policy numbers or member IDs
- Must redact or generalize sensitive mental health or substance abuse information
- For minors, must flag records requiring additional consent for sharing
- Must not extract genetic testing results without explicit permission flag""",
"""Clinical data extraction standards:
- Diagnoses must use ICD-10 codes when available, with lay descriptions
- Medications must include: generic name, brand name, dosage, frequency, route, start date
- Must differentiate between active medications and discontinued/past medications
- Allergies must specify type (drug, food, environmental) and reaction severity
- Lab results must include: value, unit, reference range, abnormal flags
- Vital signs must include measurement date/time and measurement conditions""",
"""Data quality and medical accuracy:
- Must flag any potentially dangerous drug interactions if multiple meds listed
- Must identify if vaccination records are up-to-date based on CDC guidelines
- Must extract both chief complaint and final diagnosis separately
- For chronic conditions, must note date of first diagnosis vs. most recent visit
- Must preserve clinical abbreviations but also provide expansions
- Must extract provider name, credentials, and NPI number if available"""
]
}
}
]
results = mlflow.genai.evaluate(
data=document_extraction_data,
scorers=[ExpectationsGuidelines()]
)
為 LLM 評委提供動力的模型相關資訊
- LLM 評委可能會使用第三方服務來評估您的 GenAI 應用程式,包括了由 Microsoft 運作的 Azure OpenAI。
- 針對 Azure OpenAI,Databricks 已選取退出濫用監視,因此不會使用 Azure OpenAI 儲存任何提示或回應。
- 對於歐盟 (EU) 工作區,LLM 評審使用歐盟託管的模型。 所有其他區域都會使用裝載於美國的模型。
- 停用 合作夥伴支援的 AI 功能 可防止 LLM 法官呼叫合作夥伴支援的模型。 您仍然可以透過提供自己的模型來使用 LLM 判斷器。
- LLM 評審旨在協助客戶評估其 GenAI 代理程式/應用程式,而 LLM 評估結果不應用來訓練、改善或微調 LLM。
後續步驟
- 使用內建的 LLM 評委 - 使用 MLflow 的其他研究支援的內建 LLM 評委來評估品質
- 創建自定義 LLM 評委 - 根據您的特定需求構建自定義評委
- 使評委與人工反饋保持一致 - 提高評審準確性以符合您的質量標準