チュートリアル: Synapse Machine Learning を使用して機械学習アプリケーションをビルドする

[アーティクル]
06/01/2023

この記事では、Synapse Machine Learning (SynapseML) を使用して機械学習アプリケーションを作成する方法について説明します。 SynapseML によって Apache Spark の分散機械学習ソリューションが拡張され、Azure AI Services、OpenCV、LightGBM などの多くのディープラーニングツールとデータサイエンスツールが追加されます。 SynapseML を使用すると、効果的で拡張性の高い予測と分析のモデルをさまざまな Spark データソースから構築できます。 Synapse Spark には、次のような組み込みの SynapseML ライブラリが用意されています。

Vowpal Wabbit – 機械学習用のライブラリサービスであり、ツイートの感情分析などのテキスト分析を可能にします。
MMLSpark: 大規模な機械学習エコシステムの統合 – SparkML パイプラインで Azure AI サービスの機能を組み合わせて、異常検出などのコグニティブデータモデリングサービスのソリューション設計を導き出します。
LightGBM - LightGBM は、ツリーベースの学習アルゴリズムを使用する勾配ブースティングフレームワークです。分散と効率の向上を目指して設計されています。
Conditional KNN - 条件付きクエリを使用したスケーラブル KNN モデル。
HTTP on Spark – Spark と HTTP プロトコルベースのアクセシビリティを統合するための分散マイクロサービスオーケストレーションを可能にします。

このチュートリアルでは、SynapseML で Azure AI Services を使用するサンプルについて説明します

Text Analytics - 文のセットのセンチメント (またはムード) を取得します。
Computer Vision - 画像のセットに関連付けられたタグ (1 単語の描写) を取得します。
Bing Image Search - Web で自然言語クエリに関連する画像を検索します。
Anomaly Detector - 時系列データ内の異常を検出します。

Azure サブスクリプションをお持ちでない場合は、開始する前に無料アカウントを作成してください。

前提条件

Azure Data Lake Storage Gen2 ストレージアカウントが既定のストレージとして構成されている Azure Synapse Analytics ワークスペース。使用する Data Lake Storage Gen2 ファイルシステムの "Storage Blob データ共同作成者" である必要があります。
Azure Synapse Analytics ワークスペースの Spark プール。詳細については、Azure Synapse での Spark プールの作成に関する記事を参照してください。
Azure Synapse での Azure AI サービスの構成に関するチュートリアルで説明されている事前構成手順。

開始

はじめに、SynapseML をインポートし、サービスキーを構成します。

import synapse.ml
from synapse.ml.cognitive import *
from notebookutils import mssparkutils

# An Azure AI services multi-service resource key for Text Analytics and Computer Vision (or use separate keys that belong to each service)
ai_service_key = mssparkutils.credentials.getSecret("ADD_YOUR_KEY_VAULT_NAME", "ADD_YOUR_SERVICE_KEY","ADD_YOUR_KEY_VAULT_LINKED_SERVICE_NAME") 
# A Bing Search v7 subscription key
bingsearch_service_key = mssparkutils.credentials.getSecret("ADD_YOUR_KEY_VAULT_NAME", "ADD_YOUR_BING_SEARCH_KEY","ADD_YOUR_KEY_VAULT_LINKED_SERVICE_NAME")
# An Anomaly Dectector subscription key
anomalydetector_key = mssparkutils.credentials.getSecret("ADD_YOUR_KEY_VAULT_NAME", "ADD_YOUR_ANOMALY_KEY","ADD_YOUR_KEY_VAULT_LINKED_SERVICE_NAME")

Text Analytics のサンプル

Text Analytics サービスには、テキストからインテリジェントな分析情報を抽出するためのアルゴリズムがいくつか用意されています。たとえば、指定された入力テキストのセンチメントを見つけることができます。このサービスでは、0.0 と 1.0 の間のスコアが返されます。低いスコアは否定的なセンチメントを示し、高いスコアは肯定的なセンチメントを示します。このサンプルでは、3 つの単純な文を使用し、それぞれのセンチメントを返します。

from pyspark.sql.functions import col

# Create a dataframe that's tied to it's column names
df_sentences = spark.createDataFrame([
  ("I am so happy today, its sunny!", "en-US"), 
  ("this is a dog", "en-US"), 
  ("I am frustrated by this rush hour traffic!", "en-US") 
], ["text", "language"])

# Run the Text Analytics service with options
sentiment = (TextSentiment()
    .setTextCol("text")
    .setLocation("eastasia") # Set the location of your Azure AI services resource
    .setSubscriptionKey(ai_service_key)
    .setOutputCol("sentiment")
    .setErrorCol("error")
    .setLanguageCol("language"))

# Show the results of your text query in a table format

display(sentiment.transform(df_sentences).select("text", col("sentiment")[0].getItem("sentiment").alias("sentiment")))

予想される結果

text	センチメント
I am frustrated by this rush hour traffic!	否定的
this is a dog	中立
I am so happy today, its sunny!	肯定的

Computer Vision のサンプル

Computer Vision は、画像を分析して、顔、物体、自然言語による記述などの構造を識別します。このサンプルでは、次の画像にタグを付けます。タグは、認識可能な物体、人物、風景、アクションなど、画像内のものを 1 単語で表したものです。

# Create a dataframe with the image URL
df_images = spark.createDataFrame([
        ("https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/objects.jpg", )
    ], ["image", ])

# Run the Computer Vision service. Analyze Image extracts information from/about the images.
analysis = (AnalyzeImage()
    .setLocation("eastasia") # Set the location of your Azure AI services resource
    .setSubscriptionKey(ai_service_key)
    .setVisualFeatures(["Categories","Color","Description","Faces","Objects","Tags"])
    .setOutputCol("analysis_results")
    .setImageUrlCol("image")
    .setErrorCol("error"))

# Show the results of what you wanted to pull out of the images.
display(analysis.transform(df_images).select("image", "analysis_results.description.tags"))

予想される結果

image	tags
`https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/objects.jpg`	[skating, person, man, outdoor, riding, sport, skateboard, young, board, shirt, air, park, boy, side, jumping, ramp, trick, doing, flying]

Bing Image Search のサンプル

Bing Image Search では、Web を検索して、ユーザーの自然言語クエリに関連する画像を取得します。このサンプルでは、引用付きの画像を検索するテキストクエリを使用しています。クエリに関連する写真を含む画像 URL のリストが返されます。

from pyspark.ml import PipelineModel

# Number of images Bing will return per query
imgsPerBatch = 2
# A list of offsets, used to page into the search results
offsets = [(i*imgsPerBatch,) for i in range(10)]
# Since web content is our data, we create a dataframe with options on that data: offsets
bingParameters = spark.createDataFrame(offsets, ["offset"])

# Run the Bing Image Search service with our text query
bingSearch = (BingImageSearch()
    .setSubscriptionKey(bingsearch_service_key)
    .setOffsetCol("offset")
    .setQuery("Martin Luther King Jr. quotes")
    .setCount(imgsPerBatch)
    .setOutputCol("images"))

# Transformer that extracts and flattens the richly structured output of Bing Image Search into a simple URL column
getUrls = BingImageSearch.getUrlTransformer("images", "url")
pipeline_bingsearch = PipelineModel(stages=[bingSearch, getUrls])

# Show the results of your search: image URLs
res_bingsearch = pipeline_bingsearch.transform(bingParameters)
display(res_bingsearch.dropDuplicates())

予想される結果

image
`http://everydaypowerblog.com/wp-content/uploads/2014/01/Martin-Luther-King-Jr.-Quotes-16.jpg`
`http://www.scrolldroll.com/wp-content/uploads/2017/06/6-25.png`
`http://abettertodaymedia.com/wp-content/uploads/2017/01/86783bd7a92960aedd058c91a1d10253.jpg`
`https://weneedfun.com/wp-content/uploads/2016/05/martin-luther-king-jr-quotes-11.jpg`
`http://www.sofreshandsogreen.com/wp-content/uploads/2012/01/martin-luther-king-jr-quote-sofreshandsogreendotcom.jpg`
`https://cdn.quotesgram.com/img/72/57/1104209728-martin_luther_king_jr_quotes_16.jpg`
`http://comicbookandbeyond.com/wp-content/uploads/2019/05/Martin-Luther-King-Jr.-Quotes.jpg`
`https://exposingthepain.files.wordpress.com/2015/01/martin-luther-king-jr-quotes-08.png`
`https://topmemes.me/wp-content/uploads/2020/01/Top-10-Martin-Luther-King-jr.-Quotes2-1024x538.jpg`
`http://img.picturequotes.com/2/581/580286/dr-martin-luther-king-jr-quote-1-picture-quote-1.jpg`
`http://parryz.com/wp-content/uploads/2017/06/Amazing-Martin-Luther-King-Jr-Quotes.jpg`
`http://everydaypowerblog.com/wp-content/uploads/2014/01/Martin-Luther-King-Jr.-Quotes1.jpg`
`https://lessonslearnedinlife.net/wp-content/uploads/2020/05/Martin-Luther-King-Jr.-Quotes-2020.jpg`
`https://quotesblog.net/wp-content/uploads/2015/10/Martin-Luther-King-Jr-Quotes-Wallpaper.jpg`

Anomaly Detector のサンプル

Anomaly Detector は、時系列データ内の不規則性を検出するのに適しています。このサンプルでは、このサービスを使用して、時系列全体での異常を検出します。

from pyspark.sql.functions import lit

# Create a dataframe with the point data that Anomaly Detector requires
df_timeseriesdata = spark.createDataFrame([
    ("1972-01-01T00:00:00Z", 826.0),
    ("1972-02-01T00:00:00Z", 799.0),
    ("1972-03-01T00:00:00Z", 890.0),
    ("1972-04-01T00:00:00Z", 900.0),
    ("1972-05-01T00:00:00Z", 766.0),
    ("1972-06-01T00:00:00Z", 805.0),
    ("1972-07-01T00:00:00Z", 821.0),
    ("1972-08-01T00:00:00Z", 20000.0), # anomaly
    ("1972-09-01T00:00:00Z", 883.0),
    ("1972-10-01T00:00:00Z", 898.0),
    ("1972-11-01T00:00:00Z", 957.0),
    ("1972-12-01T00:00:00Z", 924.0),
    ("1973-01-01T00:00:00Z", 881.0),
    ("1973-02-01T00:00:00Z", 837.0),
    ("1973-03-01T00:00:00Z", 9000.0) # anomaly
], ["timestamp", "value"]).withColumn("group", lit("series1"))

# Run the Anomaly Detector service to look for irregular data
anamoly_detector = (SimpleDetectAnomalies()
  .setSubscriptionKey(anomalydetector_key)
  .setLocation("eastasia")
  .setTimestampCol("timestamp")
  .setValueCol("value")
  .setOutputCol("anomalies")
  .setGroupbyCol("group")
  .setGranularity("monthly"))

# Show the full results of the analysis with the anomalies marked as "True"
display(anamoly_detector.transform(df_timeseriesdata).select("timestamp", "value", "anomalies.isAnomaly"))

予想される結果

timestamp	value	isAnomaly
1972-01-01T00:00:00Z	826.0	false
1972-02-01T00:00:00Z	799.0	false
1972-03-01T00:00:00Z	890.0	false
1972-04-01T00:00:00Z	900.0	false
1972-05-01T00:00:00Z	766.0	false
1972-06-01T00:00:00Z	805.0	false
1972-07-01T00:00:00Z	821.0	false
1972-08-01T00:00:00Z	20000.0	true
1972-09-01T00:00:00Z	883.0	false
1972-10-01T00:00:00Z	898.0	false
1972-11-01T00:00:00Z	957.0	false
1972-12-01T00:00:00Z	924.0	false
1973-01-01T00:00:00Z	881.0	false
1973-02-01T00:00:00Z	837.0	false
1973-03-01T00:00:00Z	9000.0	true

音声テキスト変換の例

音声テキスト変換サービスでは、音声のストリームまたはファイルをテキストに変換します。この例では、1 つのオーディオファイルをテキストに変換します。

# Create a dataframe with our audio URLs, tied to the column called "url"
df = spark.createDataFrame([("https://mmlspark.blob.core.windows.net/datasets/Speech/audio2.wav",)
                           ], ["url"])

# Run the Speech-to-text service to translate the audio into text
speech_to_text = (SpeechToTextSDK()
    .setSubscriptionKey(service_key)
    .setLocation("northeurope") # Set the location of your Azure AI services resource
    .setOutputCol("text")
    .setAudioDataCol("url")
    .setLanguage("en-US")
    .setProfanity("Masked"))

# Show the results of the translation
display(speech_to_text.transform(df).select("url", "text.DisplayText"))

予想される結果

url DisplayText

https://mmlspark.blob.core.windows.net/datasets/Speech/audio2.wav Custom Speech には、オーディオデータを Custom Speech ポータルから得られた認識結果と比較することによって、モデルの認識品質を視覚的に検査するツールがあります。 You can playback uploaded audio and determine if the provided recognition result is correct. This tool allows you to quickly inspect quality of Microsoft's baseline speech to text model or a trained custom model without having to transcribe any audio data.

url	DisplayText
`https://mmlspark.blob.core.windows.net/datasets/Speech/audio2.wav`	Custom Speech には、オーディオデータを Custom Speech ポータルから得られた認識結果と比較することによって、モデルの認識品質を視覚的に検査するツールがあります。 You can playback uploaded audio and determine if the provided recognition result is correct. This tool allows you to quickly inspect quality of Microsoft's baseline speech to text model or a trained custom model without having to transcribe any audio data.

リソースをクリーンアップする

Spark インスタンスがシャットダウンされるようにするには、接続されているセッション (ノートブック) を終了します。プールは、Apache Spark プールに指定されているアイドル時間に達したときにシャットダウンされます。また、ノートブックの右上にあるステータスバーから [セッションの停止] を選択することもできます。

セッションの停止を示すスクリーンショット

チュートリアル: Synapse Machine Learning を使用して機械学習アプリケーションをビルドする

前提条件

開始

Text Analytics のサンプル

予想される結果

Computer Vision のサンプル

予想される結果

Bing Image Search のサンプル

予想される結果

Anomaly Detector のサンプル

予想される結果

音声テキスト変換の例

予想される結果

リソースをクリーンアップする

次のステップ

その他のリソース