Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This tutorial presents an end-to-end example of a Synapse Data Science workflow in Microsoft Fabric. The scenario builds a model to predict whether or not bank customers churn. The churn rate, or the rate of attrition, involves the rate at which bank customers end their business with the bank.
This tutorial covers these steps:
- Install custom libraries
- Load the data
- Understand and process the data through exploratory data analysis
- Use scikit-learn and LightGBM to train machine learning models
- Evaluate and save the final machine learning model
- Show the model performance with Power BI visualizations
Prerequisites
Get a Microsoft Fabric subscription. Or, sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
Use the experience switcher on the bottom left side of your home page to switch to Fabric.
- If necessary, create a Microsoft Fabric lakehouse as described in Create a lakehouse in Microsoft Fabric.
Follow along in a notebook
You can choose one of these options to follow along in a notebook:
- Open and run the built-in notebook in the Synapse Data Science experience.
- Upload your notebook from GitHub to the Synapse Data Science experience.
Open the built-in notebook
The sample Customer churn notebook accompanies this tutorial.
To open the sample notebook for this tutorial, follow the instructions in Prepare your system for data science tutorials.
Make sure to attach a lakehouse to the notebook before you start running code.
Import the notebook from GitHub
The AIsample - R Bank Customer Churn.ipynb notebook accompanies this tutorial.
To open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science tutorials to import the notebook to your workspace.
If you'd rather copy and paste the code from this page, you can create a new notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
Step 1: Install custom libraries
For machine learning model development or ad-hoc data analysis, you might need to quickly install a custom library for your Apache Spark session. You have two options to install libraries.
- Use inline installation resources, for example
install.packages
anddevtools::install_version
, to install in your current notebook only. - Alternatively, you can create a Fabric environment, install libraries from public sources or upload custom libraries to it, and then your workspace admin can attach the environment as the default for the workspace. All the libraries in the environment will then become available for use in any notebooks and Spark job definitions in the workspace. For more information on environments, see create, configure, and use an environment in Microsoft Fabric.
In this tutorial, use install.packages()
to install the imbalance
and randomForest
libraries. Set quiet
to TRUE
to make the output more concise:
# Install imbalance for SMOTE
install.packages("imbalance", quiet = TRUE)
# Install the random forest algorithm
install.packages("randomForest", quiet=TRUE)
Step 2: Load the data
The dataset in churn.csv contains the churn status of 10,000 customers, along with 14 attributes that include:
- Credit score
- Geographical location (Germany, France, Spain)
- Gender (male, female)
- Age
- Tenure (number of years the person was a customer at that bank)
- Account balance
- Estimated salary
- Number of products that a customer purchased through the bank
- Credit card status (whether or not a customer has a credit card)
- Active member status (whether or not the person is an active bank customer)
The dataset also includes row number, customer ID, and customer surname columns. Values in these columns shouldn't influence a customer's decision to leave the bank.
A customer bank account closure event defines the churn for that customer. The dataset Exited
column refers to the customer's abandonment. Since we have little context about these attributes, we don't need background information about the dataset. We want to understand how these attributes contribute to the Exited
status.
Out of the 10,000 customers, only 2037 customers (around 20%) left the bank. Because of the class imbalance ratio, we recommend generating synthetic data generation.
This table shows a preview sample of the churn.csv
data:
CustomerID | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited |
---|---|---|---|---|---|---|---|---|---|---|---|---|
15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
Download the dataset and upload to the lakehouse
Important
Add a lakehouse to the notebook before you run it. Failure to do so will result in an error.
This code downloads a publicly available version of the dataset, and then stores that data in a Fabric lakehouse:
library(fs)
library(httr)
remote_url <- "https://sdkstorerta.blob.core.windows.net/churnblob"
file_list <- c("churn.csv")
download_path <- "/lakehouse/default/Files/churn/raw"
if (!dir_exists("/lakehouse/default")) {
stop("Default lakehouse not found, please add a lakehouse and restart the session.")
}
dir_create(download_path, recurse= TRUE)
for (fname in file_list) {
if (!file_exists(paste0(download_path, "/", fname))) {
r <- GET(paste0(remote_url, "/", fname), timeout(30))
writeBin(content(r, "raw"), paste0(download_path, "/", fname))
}
}
print("Downloaded demo data files into lakehouse.")
Start recording the time needed to run this notebook:
# Record the notebook running time
ts <- as.numeric(Sys.time())
Read raw date data from the lakehouse
This code reads raw data from the Files section of the lakehouse:
fname <- "churn.csv"
download_path <- "/lakehouse/default/Files/churn/raw"
rdf <- readr::read_csv(paste0(download_path, "/", fname))
Step 3: Perform exploratory data analysis
Display raw data
Use the head()
or str()
commands to perform a preliminary exploration of the raw data:
head(rdf)
Perform initial data cleaning
You must convert the R DataFrame to a Spark DataFrame. These operations on the Spark DataFrame clean the raw dataset:
- Drop the rows that have missing data across all columns
- Drop the duplicate rows across the columns
RowNumber
andCustomerId
- Drop the columns
RowNumber
,CustomerId
, andSurname
# Transform the R DataFrame to a Spark DataFrame
df <- as.DataFrame(rdf)
clean_data <- function(df) {
sdf <- df %>%
# Drop rows that have missing data across all columns
na.omit() %>%
# Drop duplicate rows in columns: 'RowNumber', 'CustomerId'
dropDuplicates(c("RowNumber", "CustomerId")) %>%
# Drop columns: 'RowNumber', 'CustomerId', 'Surname'
SparkR::select("CreditScore", "Geography", "Gender", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Exited")
return(sdf)
}
df_clean <- clean_data(df)
Explore the Spark DataFrame with the display
command:
display(df_clean)
This code determines the categorical, numerical, and target attributes:
# Determine the dependent (target) attribute
dependent_variable_name <- "Exited"
print(dependent_variable_name)
# Obtain the distinct values for each column
exprs = lapply(names(df_clean), function(x) alias(countDistinct(df_clean[[x]]), x))
# Use do.call to splice the aggregation expressions to aggregate function
distinct_value_number <- SparkR::collect(do.call(agg, c(x = df_clean, exprs)))
# Determine the categorical attributes
categorical_variables <- names(df_clean)[sapply(names(df_clean), function(col) col %in% c("0") || distinct_value_number[[col]] <= 5 && !(col %in% c(dependent_variable_name)))]
print(categorical_variables)
# Determine the numerical attributes
numeric_variables <- names(df_clean)[sapply(names(df_clean), function(col) coltypes(SparkR::select(df_clean, col)) == "numeric" && distinct_value_number[[col]] > 5)]
print(numeric_variables)
For easier processing and visualization, convert the cleaned Spark DataFrame to an R DataFrame:
# Transform the Spark DataFrame to an R DataFrame
rdf_clean <- SparkR::collect(df_clean)
Show the five-number summary
Use box plots to show the five-number summary (minimum score, first quartile, median, third quartile, maximum score) for the numerical attributes:
# Set the overall layout of the graphics window
par(mfrow = c(2, 1),
mar = c(2, 1, 2, 1)) # Margin size
for(item in numeric_variables[1:2]){
# Create a box plot
boxplot(rdf_clean[, item],
main = item,
col = "darkgreen",
cex.main = 1.5, # Title size
cex.lab = 1.3, # Axis label size
cex.axis = 1.2,
horizontal = TRUE) # Axis size
}
# Set the overall layout of the graphics window
par(mfrow = c(3, 1),
mar = c(2, 1, 2, 1)) # Margin size
for(item in numeric_variables[3:5]){
# Create a box plot
boxplot(rdf_clean[, item],
main = item,
col = "darkgreen",
cex.main = 1.5, # Title size
cex.lab = 1.3, # Axis label size
cex.axis = 1.2,
horizontal = TRUE) # Axis size
}
Show the distribution of exited and non-exited customers
Show the distribution of exited versus non-exited customers, across the categorical attributes:
attr_list <- c('Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'NumOfProducts', 'Tenure')
par(mfrow = c(2, 1),
mar = c(2, 1, 2, 1)) # Margin size
for (item in attr_list[1:2]) {
counts <- table(rdf_clean$Exited, rdf_clean[,item])
barplot(counts, main=item, col=c("darkblue","yellow"),
cex.main = 1.5, # Title size
cex.axis = 1.2,
legend = rownames(counts), beside=TRUE)
}
par(mfrow = c(2, 1),
mar = c(2, 2, 2, 1)) # Margin size
for (item in attr_list[3:4]) {
counts <- table(rdf_clean$Exited, rdf_clean[,item])
barplot(counts, main=item, col=c("darkblue","yellow"),
cex.main = 1.5, # Title size
cex.axis = 1.2,
legend = rownames(counts), beside=TRUE)
}
par(mfrow = c(2, 1),
mar = c(2, 1, 2, 1)) # Margin size
for (item in attr_list[5:6]) {
counts <- table(rdf_clean$Exited, rdf_clean[,item])
barplot(counts, main=item, col=c("darkblue","yellow"),
cex.main = 1.5, # Title size
cex.axis = 1.2,
legend = rownames(counts), beside=TRUE)
}
Show the distribution of numerical attributes
Use a histogram to show the frequency distribution of numerical attributes:
# Set the overall layout of the graphics window
par(mfrow = c(2, 1),
mar = c(2, 4, 2, 4) + 0.1) # Margin size
# Create a histogram
for (item in numeric_variables[1:2]) {
hist(rdf_clean[, item],
main = item,
col = "darkgreen",
xlab = item,
cex.main = 1.5, # Title size
cex.axis = 1.2,
breaks = 20) # Number of bins
}
# Set the overall layout of the graphics window
par(mfrow = c(3, 1),
mar = c(2, 4, 2, 4) + 0.1) # Margin size
# Create a histogram
for (item in numeric_variables[3:5]) {
hist(rdf_clean[, item],
main = item,
col = "darkgreen",
xlab = item,
cex.main = 1.5, # Title size
cex.axis = 1.2,
breaks = 20) # Number of bins
}
Perform feature engineering
This feature engineering generates new attributes based on current attributes:
rdf_clean$NewTenure <- rdf_clean$Tenure / rdf_clean$Age
rdf_clean$NewCreditsScore <- as.numeric(cut(rdf_clean$CreditScore, breaks=quantile(rdf_clean$CreditScore, probs=seq(0, 1, by=1/6)), include.lowest=TRUE, labels=c(1, 2, 3, 4, 5, 6)))
rdf_clean$NewAgeScore <- as.numeric(cut(rdf_clean$Age, breaks=quantile(rdf_clean$Age, probs=seq(0, 1, by=1/8)), include.lowest=TRUE, labels=c(1, 2, 3, 4, 5, 6, 7, 8)))
rdf_clean$NewBalanceScore <- as.numeric(cut(rank(rdf_clean$Balance), breaks=quantile(rank(rdf_clean$Balance, ties.method = "first"), probs=seq(0, 1, by=1/5)), include.lowest=TRUE, labels=c(1, 2, 3, 4, 5)))
rdf_clean$NewEstSalaryScore <- as.numeric(cut(rdf_clean$EstimatedSalary, breaks=quantile(rdf_clean$EstimatedSalary, probs=seq(0, 1, by=1/10)), include.lowest=TRUE, labels=c(1:10)))
Perform one-hot encoding
Use one-hot encoding to convert the categorical attributes to numerical attributes, to feed them into the machine learning model:
rdf_clean <- cbind(rdf_clean, model.matrix(~Geography+Gender-1, data=rdf_clean))
rdf_clean <- subset(rdf_clean, select = - c(Geography, Gender))
Create a delta table to generate the Power BI report
table_name <- "rdf_clean"
# Create a Spark DataFrame from an R DataFrame
sparkDF <- as.DataFrame(rdf_clean)
write.df(sparkDF, paste0("Tables/", table_name), source = "delta", mode = "overwrite")
cat(paste0("Spark DataFrame saved to delta table: ", table_name))
Summary of observations from the exploratory data analysis
- Most of the customers are from France. Spain has the lowest churn rate, compared to France and Germany.
- Most customers have credit cards
- Some customers are both over the age of 60 and have credit scores below 400. However, they can't be considered as outliers
- Few customers have more than two bank products
- Inactive customers have a higher churn rate
- Gender and tenure years have little impact on a customer's decision to close a bank account
Step 4: Perform model training
With the data in place, you can now define the model. Apply Random Forest and LightGBM models. Use randomForest and LightGBM to implement the models with a few lines of code.
Load the delta table from the lakehouse. You can use other delta tables that consider the lakehouse as the source.
SEED <- 12345
rdf_clean <- read.df("Tables/rdf_clean", source = "delta")
df_clean <- as.data.frame(rdf_clean)
Import randomForest and LightGBM:
library(randomForest)
library(lightgbm)
Prepare the training and test datasets:
set.seed(SEED)
y <- factor(df_clean$Exited)
X <- df_clean[, !(colnames(df_clean) %in% c("Exited"))]
split <- base::sample(c(TRUE, FALSE), nrow(df_clean), replace = TRUE, prob = c(0.8, 0.2))
X_train <- X[split,]
X_test <- X[!split,]
y_train <- y[split]
y_test <- y[!split]
train_df <- cbind(X_train, y_train)
Apply SMOTE to the training dataset
Imbalanced classification has a problem, because it has too few examples of the minority class for a model to effectively learn the decision boundary. To handle this, Synthetic Minority Oversampling Technique (SMOTE) is the most widely used technique to synthesize new samples for the minority class. Access SMOTE with the imblearn
library that you installed in step 1.
Apply SMOTE only to the training dataset. You must leave the test dataset in its original imbalanced distribution, to get a valid approximation of model performance on the original data. This experiment represents the situation in production.
First, show the distribution of classes in the dataset, to learn which class is the minority class. The ratio of minority class to majority class is defined as imbalance Ratio
in the imbalance
library.
original_ratio <- imbalance::imbalanceRatio(train_df, classAttr = "y_train")
message(sprintf("Original imbalance ratio is %.2f%% as {Size of minority class}/{Size of majority class}.", original_ratio * 100))
message(sprintf("Positive class(Exited) takes %.2f%% of the dataset.", round(sum(train_df$y_train == 1)/nrow(train_df) * 100, 2)))
message(sprintf("Negative class(Non-Exited) takes %.2f%% of the dataset.", round(sum(train_df$y_train == 0)/nrow(train_df) * 100, 2)))
In the training dataset:
Positive class(Exited)
refers to the minority class, which takes 20.34% of the dataset.Negative class(Non-Exited)
refers to the majority class, which takes 79.66% of the dataset.
The next cell rewrites the oversample function of the imbalance
library, to generate a balanced dataset:
binary_oversample <- function(train_df, X_train, y_train, class_Attr = "Class"){
negative_num <- sum(y_train == 0) # Compute the number of the negative class
positive_num <- sum(y_train == 1) # Compute the number of the positive class
difference_num <- abs(negative_num - positive_num) # Compute the difference between the negative and positive classes
originalShape <- imbalance:::datasetStructure(train_df, class_Attr) # Get the original dataset schema
new_samples <- smotefamily::SMOTE(X_train, y_train, dup_size = ceiling(max(negative_num, positive_num)/min(negative_num, positive_num))) # Use SMOTE to oversample
new_samples <- new_samples$syn_data # Get the synthetic data
new_samples <- new_samples[base::sample(1:nrow(new_samples), size = difference_num), ] # Sample and shuffle the synthetic data
new_samples <- new_samples[, -ncol(new_samples)] # Remove the class column
new_samples <- imbalance:::normalizeNewSamples(originalShape, new_samples) # Normalize the synthetic data
new_train_df <- rbind(train_df, new_samples) # Concatenate original and synthetic data by row
new_train_df <- new_train_df[base::sample(nrow(new_train_df)), ] # Shuffle the training dataset
new_train_df
}
To learn more about SMOTE, see the Package imbalance
and Working with imbalanced datasets resources on the CRAN website.
Oversample the training dataset
Use the newly defined oversample function to perform oversampling on the training dataset:
library(dplyr)
new_train_df <- binary_oversample(train_df, X_train, y_train, class_Attr="y_train")
smote_ratio <- imbalance::imbalanceRatio(new_train_df, classAttr = "y_train")
message(sprintf("Imbalance ratio after using smote is %.2f%%\n", smote_ratio * 100))
Train the model
Use random forest to train the model, with four features:
set.seed(1)
rfc1_sm <- randomForest(y_train ~ ., data = new_train_df, ntree = 500, mtry = 4, nodesize = 3)
y_pred <- predict(rfc1_sm, X_test, type = "response")
cr_rfc1_sm <- caret::confusionMatrix(y_pred, y_test)
cm_rfc1_sm <- table(y_pred, y_test)
roc_auc_rfc1_sm <- pROC::auc(pROC::roc(as.numeric(y_test), as.numeric(y_pred)))
print(paste0("The auc is ", roc_auc_rfc1_sm))
Use random forest to train the model, with six features:
rfc2_sm <- randomForest(y_train ~ ., data = new_train_df, ntree = 500, mtry = 6, nodesize = 3)
y_pred <- predict(rfc2_sm, X_test, type = "response")
cr_rfc2_sm <- caret::confusionMatrix(y_pred, y_test)
cm_rfc2_sm <- table(y_pred, y_test)
roc_auc_rfc2_sm <- pROC::auc(pROC::roc(as.numeric(y_test), as.numeric(y_pred)))
print(paste0("The auc is ", roc_auc_rfc2_sm))
Train the model with LightGBM:
set.seed(42)
X_train <- new_train_df[, !(colnames(new_train_df) %in% c("y_train"))]
y_train <- as.numeric(as.character(new_train_df$y_train))
y_test <- as.numeric(as.character(y_test))
lgbm_sm_model <- lgb.train(list(objective = "binary", learning_rate = 0.1, max_delta_step = 2, nrounds = 100, max_depth = 10, eval_metric = "logloss"), lgb.Dataset(as.matrix(X_train), label = as.vector(y_train)), valids = list(test = lgb.Dataset(as.matrix(X_test), label = as.vector(as.numeric(y_test)))))
y_pred <- as.numeric(predict(lgbm_sm_model, as.matrix(X_test)) > 0.5)
accuracy <- mean(y_pred == as.vector(y_test))
cr_lgbm_sm <- caret::confusionMatrix(as.factor(y_pred), as.factor(as.vector(y_test)))
cm_lgbm_sm <- table(y_pred, as.vector(y_test))
roc_auc_lgbm_sm <- pROC::auc(pROC::roc(as.vector(y_test), y_pred))
print(paste0("The auc is ", roc_auc_lgbm_sm))
Step 5: Evaluate and save the final machine learning model
Assess the performance of the saved models on the test dataset:
ypred_rfc1_sm <- predict(rfc1_sm, X_test, type = "response")
ypred_rfc2_sm <- predict(rfc2_sm, X_test, type = "response")
ypred_lgbm1_sm <- as.numeric(predict(lgbm_sm_model, as.matrix(X_test)) > 0.5)
Show true/false positives/negatives with a confusion matrix. Develop a script to plot the confusion matrix, to evaluate the classification accuracy:
plot_confusion_matrix <- function(cm, classes, normalize=FALSE, title='Confusion matrix', cmap=heat.colors(10)) {
if (normalize) {
cm <- cm / rowSums(cm)
}
op <- par(mar = c(6,6,3,1))
image(1:nrow(cm), 1:ncol(cm), t(cm[nrow(cm):1,]), col = cmap, xaxt = 'n', yaxt = 'n', main = title, xlab = "Prediction", ylab = "Reference")
axis(1, at = 1:nrow(cm), labels = classes, las = 2)
axis(2, at = 1:ncol(cm), labels = rev(classes))
for (i in seq_len(nrow(cm))) {
for (j in seq_len(ncol(cm))) {
text(i, ncol(cm) - j + 1, cm[j,i], cex = 0.8)
}
}
par(op)
}
Create a confusion matrix for the random forest classifier, with four features:
cfm <- table(y_test, ypred_rfc1_sm)
plot_confusion_matrix(cfm, classes=c('Non Churn','Churn'), title='Random Forest with features of 4')
tn <- cfm[1,1]
fp <- cfm[1,2]
fn <- cfm[2,1]
tp <- cfm[2,2]
Create a confusion matrix for the random forest classifier, with six features:
cfm <- table(y_test, ypred_rfc2_sm)
plot_confusion_matrix(cfm, classes=c('Non Churn','Churn'), title='Random Forest with features of 6')
tn <- cfm[1,1]
fp <- cfm[1,2]
fn <- cfm[2,1]
tp <- cfm[2,2]
Create a confusion matrix for LightGBM:
cfm <- table(y_test, ypred_lgbm1_sm)
plot_confusion_matrix(cfm, classes=c('Non Churn','Churn'), title='LightGBM')
tn <- cfm[1,1]
fp <- cfm[1,2]
fn <- cfm[2,1]
tp <- cfm[2,2]
Save results for Power BI
Save the delta frame to the lakehouse, to move the model prediction results to a Power BI visualization:
df_pred <- X_test
df_pred$y_test <- y_test
df_pred$ypred_rfc1_sm <- ypred_rfc1_sm
df_pred$ypred_rfc2_sm <- ypred_rfc2_sm
df_pred$ypred_lgbm1_sm <- ypred_lgbm1_sm
table_name <- "df_pred_results"
sparkDF <- as.DataFrame(df_pred)
write.df(sparkDF, paste0("Tables/", table_name), source = "delta", mode = "overwrite", overwriteSchema = "true")
cat(paste0("Spark DataFrame saved to delta table: ", table_name))
Step 6: Access visualizations in Power BI
Access your saved table in Power BI:
- On the left, select OneLake data hub
- Select the lakehouse that you added to this notebook
- In the Open this Lakehouse section, select Open
- On the ribbon, select New semantic model. Select
df_pred_results
, and then select Continue to create a new Power BI semantic model linked to the predictions - On the tools at the top of the dataset page, select New report to open the Power BI report authoring page.
The following screenshot shows some example visualizations. The data panel shows the delta tables and columns to select from a table. After selection of appropriate category (x) axis and value (y) axis, you can choose the filters and functions. For example, you can choose a sum or average of the table column.
Note
The screenshot is an illustrated example that shows the analysis of the saved prediction results in Power BI. For a real use case of customer churn, platform users might need a more thorough ideation of the visualizations to create, based on both subject matter expertise and what the organization and business analytics team and firm have standardized as metrics.
The Power BI report shows that customers who use more than two of the bank products have a higher churn rate. However, few customers had more than two products. (See the plot in the lower-left panel.) The bank should collect more data but also investigate other features that correlate with more products.
Bank customers in Germany have a higher churn rate compared to customers in France and Spain. (See the plot in the lower-right panel.) Based on the report results, an investigation into the factors that encouraged customers to leave might help.
There are more middle-aged customers (between 25 and 45). Customers between 45 and 60 tend to exit more.
Finally, customers with lower credit scores would most likely leave the bank for other financial institutions. The bank should explore ways to encourage customers with lower credit scores and account balances to stay with the bank.
# Determine the entire runtime
cat(paste0("Full run cost ", as.integer(Sys.time() - ts), " seconds.\n"))