Build a model with SynapseML
This article describes how to build a machine learning model by using SynapseML, and demonstrates how SynapseML can simplify complex machine learning tasks. You use SynapseML to create a small machine learning training pipeline that includes a featurization stage and a LightGBM regression stage. The pipeline predicts ratings based on review text from a dataset of book reviews. You also see how SynapseML can simplify the use of prebuilt models to solve machine learning problems.
Prerequisites
Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the left side of your home page to switch to the Synapse Data Science experience.
Prepare resources
Create the tools and resources you need to build the model and pipeline.
- Create a new notebook.
- Attach your notebook to a lakehouse. To add an existing lakehouse or create a new one, expand Lakehouses under Explorer at left, and then select Add.
- Get an Azure AI services key by following the instructions in Quickstart: Create a multi-service resource for Azure AI services.
- Create an Azure Key Vault instance and add your Azure AI services key to the key vault as a secret.
- Make a note of your key vault name and secret name. You need this information to run the one-step transform later in this article.
Set up the environment
In your notebook, import SynapseML libraries and initialize your Spark session.
from pyspark.sql import SparkSession
from synapse.ml.core.platform import *
spark = SparkSession.builder.getOrCreate()
Load a dataset
Load your dataset and split it into train and test sets.
train, test = (
spark.read.parquet(
"wasbs://publicwasb@mmlspark.blob.core.windows.net/BookReviewsFromAmazon10K.parquet"
)
.limit(1000)
.cache()
.randomSplit([0.8, 0.2])
)
display(train)
Create the training pipeline
Create a pipeline that featurizes data using TextFeaturizer
from the synapse.ml.featurize.text
library and derives a rating using the LightGBMRegressor
function.
from pyspark.ml import Pipeline
from synapse.ml.featurize.text import TextFeaturizer
from synapse.ml.lightgbm import LightGBMRegressor
model = Pipeline(
stages=[
TextFeaturizer(inputCol="text", outputCol="features"),
LightGBMRegressor(featuresCol="features", labelCol="rating", dataTransferMode="bulk")
]
).fit(train)
Predict the output of the test data
Call the transform
function on the model to predict and display the output of the test data as a dataframe.
display(model.transform(test))
Use Azure AI services to transform data in one step
Alternatively, for these kinds of tasks that have a prebuilt solution, you can use SynapseML's integration with Azure AI services to transform your data in one step. Run the following code with these replacements:
- Replace
<secret-name>
with the name of your Azure AI Services key secret. - Replace
<key-vault-name>
with the name of your key vault.
from synapse.ml.services import TextSentiment
from synapse.ml.core.platform import find_secret
model = TextSentiment(
textCol="text",
outputCol="sentiment",
subscriptionKey=find_secret("<secret-name>", "<key-vault-name>")
).setLocation("eastus")
display(model.transform(test))