教學課程：建立自定義搜尋引擎和問答系統

發行項
01/23/2024

在本教學課程中，瞭解如何為從Spark叢集載入的大型數據編製索引和查詢。您設定執行下列動作的 Jupyter Notebook：

將各種表單（發票）載入 Apache Spark 工作階段中的數據框架

分析它們以判斷其功能

將產生的輸出組合成表格式數據結構

將輸出寫入裝載於 Azure 認知搜尋中的搜尋索引

探索和查詢您所建立的內容

1 - 設定相依性

首先，我們會匯入套件並連線到此工作流程中使用的 Azure 資源。

import os
from pyspark.sql import SparkSession
from synapse.ml.core.platform import running_on_synapse, find_secret

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

cognitive_key = find_secret("cognitive-api-key") # replace with your cognitive api key
cognitive_location = "eastus"

translator_key = find_secret("translator-key") # replace with your cognitive api key
translator_location = "eastus"

search_key = find_secret("azure-search-key") # replace with your cognitive api key
search_service = "mmlspark-azure-search"
search_index = "form-demo-index-5"

openai_key = find_secret("openai-api-key") # replace with your open ai api key
openai_service_name = "synapseml-openai"
openai_deployment_name = "gpt-35-turbo"
openai_url = f"https://{openai_service_name}.openai.azure.com/"

2 - 將數據載入 Spark

此程式代碼會從用於示範用途的 Azure 記憶體帳戶載入一些外部檔案。這些檔案是各種發票，而且會讀取到數據框架中。

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


def blob_to_url(blob):
    [prefix, postfix] = blob.split("@")
    container = prefix.split("/")[-1]
    split_postfix = postfix.split("/")
    account = split_postfix[0]
    filepath = "/".join(split_postfix[1:])
    return "https://{}/{}/{}".format(account, container, filepath)


df2 = (
    spark.read.format("binaryFile")
    .load("wasbs://ignite2021@mmlsparkdemo.blob.core.windows.net/form_subset/*")
    .select("path")
    .limit(10)
    .select(udf(blob_to_url, StringType())("path").alias("url"))
    .cache()
)

display(df2)

3 - 套用窗體辨識

此程式代碼會載入 AnalyzeInvoices 轉換器，並傳遞包含發票的數據框架參考。它會呼叫 Azure Forms Analyzer 的預先建置發票模型。

from synapse.ml.cognitive import AnalyzeInvoices

analyzed_df = (
    AnalyzeInvoices()
    .setSubscriptionKey(cognitive_key)
    .setLocation(cognitive_location)
    .setImageUrlCol("url")
    .setOutputCol("invoices")
    .setErrorCol("errors")
    .setConcurrency(5)
    .transform(df2)
    .cache()
)

display(analyzed_df)

4 - 簡化窗體辨識輸出

此程式代碼會使用 FormOntologyLearner，此轉換器會分析表格辨識器轉換器的輸出（適用於 Azure AI 檔智慧），並推斷表格式數據結構。 AnalyzeInvoices 的輸出是動態的，會根據內容中偵測到的功能而有所不同。

FormOntologyLearner 藉由尋找可用來建立表格式數據結構的模式，來擴充 AnalyzeInvoices 轉換器的公用程式。將輸出組織成多個數據行和數據列，可簡化下游分析。

from synapse.ml.cognitive import FormOntologyLearner

organized_df = (
    FormOntologyLearner()
    .setInputCol("invoices")
    .setOutputCol("extracted")
    .fit(analyzed_df)
    .transform(analyzed_df)
    .select("url", "extracted.*")
    .cache()
)

display(organized_df)

透過良好的表格式數據框架，我們可以使用一些 SparkSQL 將表單中找到的巢狀數據表扁平化

from pyspark.sql.functions import explode, col

itemized_df = (
    organized_df.select("*", explode(col("Items")).alias("Item"))
    .drop("Items")
    .select("Item.*", "*")
    .drop("Item")
)

display(itemized_df)

5 - 新增翻譯

此程式代碼會載入 Translate，此轉換器會呼叫 Azure AI 服務中的 Azure AI 翻譯工具服務。原始文字在 [描述] 數據行中英文，會以機器翻譯成各種語言。所有輸出都會合併為「output.translations」陣列。

from synapse.ml.cognitive import Translate

translated_df = (
    Translate()
    .setSubscriptionKey(translator_key)
    .setLocation(translator_location)
    .setTextCol("Description")
    .setErrorCol("TranslationError")
    .setOutputCol("output")
    .setToLanguage(["zh-Hans", "fr", "ru", "cy"])
    .setConcurrency(5)
    .transform(itemized_df)
    .withColumn("Translations", col("output.translations")[0])
    .drop("output", "TranslationError")
    .cache()
)

display(translated_df)

6 - 使用 OpenAI 🤯 將產品轉譯為 Emoji

from synapse.ml.cognitive.openai import OpenAIPrompt
from pyspark.sql.functions import trim, split

emoji_template = """ 
  Your job is to translate item names into emoji. Do not add anything but the emoji and end the translation with a comma
  
  Two Ducks: 🦆🦆,
  Light Bulb: 💡,
  Three Peaches: 🍑🍑🍑,
  Two kitchen stoves: ♨️♨️,
  A red car: 🚗,
  A person and a cat: 🧍🐈,
  A {Description}: """

prompter = (
    OpenAIPrompt()
    .setSubscriptionKey(openai_key)
    .setDeploymentName(openai_deployment_name)
    .setUrl(openai_url)
    .setMaxTokens(5)
    .setPromptTemplate(emoji_template)
    .setErrorCol("error")
    .setOutputCol("Emoji")
)

emoji_df = (
    prompter.transform(translated_df)
    .withColumn("Emoji", trim(split(col("Emoji"), ",").getItem(0)))
    .drop("error", "prompt")
    .cache()
)

display(emoji_df.select("Description", "Emoji"))

7 - 使用 OpenAI 推斷廠商尋址大陸

continent_template = """
Which continent does the following address belong to? 

Pick one value from Europe, Australia, North America, South America, Asia, Africa, Antarctica. 

Dont respond with anything but one of the above. If you don't know the answer or cannot figure it out from the text, return None. End your answer with a comma.

Address: "6693 Ryan Rd, North Whales",
Continent: Europe,
Address: "6693 Ryan Rd",
Continent: None,
Address: "{VendorAddress}",
Continent:"""

continent_df = (
    prompter.setOutputCol("Continent")
    .setPromptTemplate(continent_template)
    .transform(emoji_df)
    .withColumn("Continent", trim(split(col("Continent"), ",").getItem(0)))
    .drop("error", "prompt")
    .cache()
)

display(continent_df.select("VendorAddress", "Continent"))

8 - 建立表單的 Azure 搜尋服務索引

from synapse.ml.cognitive import *
from pyspark.sql.functions import monotonically_increasing_id, lit

(
    continent_df.withColumn("DocID", monotonically_increasing_id().cast("string"))
    .withColumn("SearchAction", lit("upload"))
    .writeToAzureSearch(
        subscriptionKey=search_key,
        actionCol="SearchAction",
        serviceName=search_service,
        indexName=search_index,
        keyCol="DocID",
    )
)

9 - 試用搜尋查詢

import requests

search_url = "https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06".format(
    search_service, search_index
)
requests.post(
    search_url, json={"search": "door"}, headers={"api-key": search_key}
).json()

10 - 建置聊天機器人，以使用 Azure 搜尋服務作為工具 🧠🔧

import json
import openai

openai.api_type = "azure"
openai.api_base = openai_url
openai.api_key = openai_key
openai.api_version = "2023-03-15-preview"

chat_context_prompt = f"""
You are a chatbot designed to answer questions with the help of a search engine that has the following information:

{continent_df.columns}

If you dont know the answer to a question say "I dont know". Do not lie or hallucinate information. Be brief. If you need to use the search engine to solve the please output a json in the form of {{"query": "example_query"}}
"""


def search_query_prompt(question):
    return f"""
Given the search engine above, what would you search for to answer the following question?

Question: "{question}"

Please output a json in the form of {{"query": "example_query"}}
"""


def search_result_prompt(query):
    search_results = requests.post(
        search_url, json={"search": query}, headers={"api-key": search_key}
    ).json()
    return f"""

You previously ran a search for "{query}" which returned the following results:

{search_results}

You should use the results to help you answer questions. If you dont know the answer to a question say "I dont know". Do not lie or hallucinate information. Be Brief and mention which query you used to solve the problem. 
"""


def prompt_gpt(messages):
    response = openai.ChatCompletion.create(
        engine=openai_deployment_name, messages=messages, max_tokens=None, top_p=0.95
    )
    return response["choices"][0]["message"]["content"]


def custom_chatbot(question):
    while True:
        try:
            query = json.loads(
                prompt_gpt(
                    [
                        {"role": "system", "content": chat_context_prompt},
                        {"role": "user", "content": search_query_prompt(question)},
                    ]
                )
            )["query"]

            return prompt_gpt(
                [
                    {"role": "system", "content": chat_context_prompt},
                    {"role": "system", "content": search_result_prompt(query)},
                    {"role": "user", "content": question},
                ]
            )
        except Exception as e:
            raise e

11 - 詢問聊天機器人的問題

custom_chatbot("What did Luke Diaz buy?")

12 - 快速雙重檢查

display(
    continent_df.where(col("CustomerName") == "Luke Diaz")
    .select("Description")
    .distinct()
)

共用方式為

教學課程：建立自定義搜尋引擎和問答系統

1 - 設定相依性

2 - 將數據載入 Spark

3 - 套用窗體辨識

4 - 簡化窗體辨識輸出

5 - 新增翻譯

6 - 使用 OpenAI 🤯 將產品轉譯為 Emoji

7 - 使用 OpenAI 推斷廠商尋址大陸

8 - 建立表單的 Azure 搜尋服務索引

9 - 試用搜尋查詢

10 - 建置聊天機器人，以使用 Azure 搜尋服務作為工具 🧠🔧

11 - 詢問聊天機器人的問題

12 - 快速雙重檢查

意見反應

意見反應

其他資源

共用方式為

教學課程：建立自定義搜尋引擎和問答系統

1 - 設定相依性

2 - 將數據載入 Spark

3 - 套用窗體辨識

4 - 簡化窗體辨識輸出

5 - 新增翻譯

6 - 使用 OpenAI 🤯 將產品轉譯為 Emoji

7 - 使用 OpenAI 推斷廠商尋址大陸

8 - 建立表單的 Azure 搜尋服務索引

9 - 試用搜尋查詢

10 - 建置聊天機器人，以使用 Azure 搜尋服務作為工具 🧠🔧

11 - 詢問聊天機器人的問題

12 - 快速雙重檢查

相關內容

意見反應

意見反應

其他資源