Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
SynapseML extends the Apache Spark distributed computing framework with deep learning and data science tools, including:
- LightGBM for gradient boosting
- OpenCV for image processing
- Seamless integration with Spark ML pipelines
These tools enable powerful, scalable predictive and analytical models for many data sources.
In this article, you train a binary classification model on the Adult Census Income dataset, using the SynapseML TrainClassifier wrapper and evaluate it with ComputeModelStatistics.
Prerequisites
Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Switch to Fabric by using the experience switcher on the lower-left side of your home page.
- Create a new notebook.
- Attach your notebook to a lakehouse. On the left side of your notebook, select Add to add an existing lakehouse or create a new one.
Import libraries
Note
Fabric notebooks provide a pre-configured Spark session (the spark variable) and an IPython kernel with common libraries pre-imported. You don't need to install SynapseML, PySpark, NumPy, or pandas separately.
In your Fabric notebook, create a new cell and import the required libraries:
import numpy as np
import pandas as pd
Expected output: The cell completes with no errors. If you see an ImportError, verify that your notebook is attached to a lakehouse and running on a Fabric Spark runtime.
Download and load the data
Download the Adult Census Income dataset and load it into a Spark DataFrame. This dataset contains census features like education, marital status, and hours worked per week, along with an income label column.
import os
import urllib.request
dataFile = "AdultCensusIncome.csv"
if not os.path.isfile(dataFile):
urllib.request.urlretrieve(
"https://mmlspark.azureedge.net/datasets/" + dataFile, dataFile
)
data = spark.createDataFrame(
pd.read_csv(dataFile, dtype={" hours-per-week": np.float64})
)
data.show(5)
Note
The column names in this dataset include a leading space (for example, " income" rather than "income"). The code samples in this article preserve those names as-is to match the source CSV.
Expected output: A table showing five rows with 15 columns, including age, workclass, education, and income. You should see 32,561 total rows:
# Verification: confirm row count
print(f"Row count: {data.count()}") # Expected: 32561
Select features and split the data
Select the feature columns and the label column ( income), then split the data into training (75%) and test (25%) sets:
data = data.select([" education", " marital-status", " hours-per-week", " income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
Verification - confirm the split produced the expected proportions:
print(f"Training rows: {train.count()}") # Expected: about 24,400
print(f"Test rows: {test.count()}") # Expected: about 8,100
The exact counts might vary slightly, but training should contain approximately 75% of the total rows.
Train the model
Use the TrainClassifier class from synapse.ml.train to train a logistic regression classifier. TrainClassifier wraps a base SparkML classifier, handles string-valued feature columns automatically, and binarizes the label column.
from synapse.ml.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol=" income").fit(train)
Expected output: The cell completes without errors. The model variable contains a fitted TrainedClassifierModel.
Verification:
print(f"Model type: {type(model).__name__}") # Expected: TrainedClassifierModel
Score and evaluate the model
Score the model against the test set, then use the ComputeModelStatistics class to compute accuracy, Area Under the Curve (AUC), precision, and recall:
from synapse.ml.train import ComputeModelStatistics
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.select("accuracy").show()
Verification - view all computed metrics:
metrics.show()
You should see columns for accuracy, precision, recall, and AUC.
Explore SynapseML classes
Use the Python help() function to view documentation for SynapseML classes:
from synapse.ml.train import TrainClassifier
help(TrainClassifier)
Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
AttributeError: module 'urllib' has no attribute 'request' |
Running code outside a Fabric notebook (plain Python script) | Change import os, urllib to import os, urllib.request |
KeyError when selecting columns |
Column names in this dataset include leading spaces | Ensure you use " income" (with the leading space), not "income" |
AnalysisException: cannot resolve column |
Column name mismatch | Run data.columns to inspect exact column names |
TrainClassifier or ComputeModelStatistics import fails |
Incorrect import path | Use from synapse.ml.train import TrainClassifier, not from synapse.ml import TrainClassifier |
Spark session not available (NameError: name 'spark' is not defined) |
Notebook not attached to lakehouse or Spark not started | Attach your notebook to a lakehouse and restart the session |
Download fails with timeout or URLError |
Network restrictions in your workspace | Upload the CSV to your lakehouse manually and read with spark.read.csv("Files/AdultCensusIncome.csv", header=True, inferSchema=True) |
Clean up resources
The notebook doesn't create any persistent Azure resources beyond the lakehouse files. To clean up the downloaded CSV:
import os
if os.path.isfile("AdultCensusIncome.csv"):
os.remove("AdultCensusIncome.csv")
print("Cleaned up AdultCensusIncome.csv")