Build a model with SynapseML

2024-06-09

This article describes how to build a machine learning model by using SynapseML, and demonstrates how SynapseML can simplify complex machine learning tasks. You use SynapseML to create a small machine learning training pipeline that includes a featurization stage and a LightGBM regression stage. The pipeline predicts ratings based on review text from a dataset of book reviews. You also see how SynapseML can simplify the use of prebuilt models to solve machine learning problems.

Prerequisites

Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the bottom left side of your home page to switch to Fabric.

Prepare resources

Create the tools and resources you need to build the model and pipeline.

Create a new notebook.
Attach your notebook to a lakehouse. To add an existing lakehouse or create a new one, expand Lakehouses under Explorer at left, and then select Add.
Get an Azure AI services key by following the instructions in Quickstart: Create a multi-service resource for Azure AI services.
Create an Azure Key Vault instance and add your Azure AI services key to the key vault as a secret.
Make a note of your key vault name and secret name. You need this information to run the one-step transform later in this article.

Set up the environment

In your notebook, import SynapseML libraries and initialize your Spark session.

from pyspark.sql import SparkSession
from synapse.ml.core.platform import *

spark = SparkSession.builder.getOrCreate()

Load a dataset

Load your dataset and split it into train and test sets.

train, test = (
    spark.read.parquet(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/BookReviewsFromAmazon10K.parquet"
    )
    .limit(1000)
    .cache()
    .randomSplit([0.8, 0.2])
)

display(train)

Create the training pipeline

Create a pipeline that featurizes data using TextFeaturizer from the synapse.ml.featurize.text library and derives a rating using the LightGBMRegressor function.

from pyspark.ml import Pipeline
from synapse.ml.featurize.text import TextFeaturizer
from synapse.ml.lightgbm import LightGBMRegressor

model = Pipeline(
    stages=[
        TextFeaturizer(inputCol="text", outputCol="features"),
        LightGBMRegressor(featuresCol="features", labelCol="rating", dataTransferMode="bulk")
    ]
).fit(train)

Predict the output of the test data

Call the transform function on the model to predict and display the output of the test data as a dataframe.

display(model.transform(test))

Use Azure AI services to transform data in one step

Alternatively, for these kinds of tasks that have a prebuilt solution, you can use SynapseML's integration with Azure AI services to transform your data in one step. Run the following code with these replacements:

Replace <secret-name> with the name of your Azure AI Services key secret.
Replace <key-vault-name> with the name of your key vault.

from synapse.ml.services import TextSentiment
from synapse.ml.core.platform import find_secret

model = TextSentiment(
    textCol="text",
    outputCol="sentiment",
    subscriptionKey=find_secret("<secret-name>", "<key-vault-name>")
).setLocation("eastus")

display(model.transform(test))

Share via