Getting started with Machine Learning–Wisconsin Breast Cancer Dataset
In this post I will show you step by tutorial on how to create a basic two-class machine learning experiment using breast cancer data. This post is part of a series of different two-class prediction examples to help you learn how to create experiments using Azure Machine Learning studio
For a more comprehensive introduction to data science and Azure Machine Learning Studio check out Data Science and Machine Learning Essentials on MVA
Creating a Machine Learning Workspace
To use Azure Machine Learning Studio from your Azure account, you need a Machine Learning workspace. This workspace contains the tools you need to create, manage, and publish machine learning experiments.
To create a workspace, sign-in to your Microsoft Azure account.
How do I get access to Azure Machine Learning Studio?
Azure Machine Learning studio is part of Microsoft Azure. Microsoft Azure is a paid service, but there are a number of programs or trials you can use to explore it’s capabilities
- You can try the guest sign in for Azure ML Studio to explore many (but not all) features of Azure Machine Learning Studio
- Sign up for a one month free trial of Microsoft Azure
- Students get some Azure features for free through DreamSpark, learn how to sign up for student Azure (unfortunately Azure ML studio is not included in the DreamSpark/Azure offer)
- Will you use Machine Learning in your start-up? Check out Microsoft BizSpark, a program for startups which includes Azure benefits through MSDN
- If you work at a company, ask if you have an MSDN subscription, you may already have access to Azure
- If you want to use it in a course at school, faculty can apply for Azure Education grants, which provide all students in the class with a 6 month Azure pass.
- If you want to use it for academic research, you can apply for Azure Research grants at azure4research.com
1. Navigate to the Microsoft Azure portal portal.azure.com and log in using your Microsoft account credentials
2. In the Microsoft Azure portal create a new machine learning workspace. Select + New | Data + Analytics | Machine Learning
You will be redirected to the original Azure portal to enter the details for your machine learning workspace.
1. Enter a WORKSPACE NAME for your workspace
NOTE: Later, you can share the experiments you're working on by inviting others to your workspace. You can do this in Machine Learning Studio on the SETTINGS page. You just need the Microsoft account or organizational account for each user.
2. Specify the Azure LOCATION
3. Select an existing Azure STORAGE ACCOUNT or select Create a new storage account to create a new one and give your new storage account a name.
4. Select CREATE AN ML WORKSPACE.
Creating a new experiment in Azure Machine Learning Studio
After your Machine Learning workspace is created, you will see it listed on the portal under MACHINE LEARNING. At the time this post was created Machine Learning Workspaces are always displayed in the Azure Classic portal (even if you select the menu option from the new portal to create it), at some point the new portal will be updated so you can list them without going to the Classic view.
Once your Machine Learning workspace is created, select your workspace from the list and then select Sign-in to ML Studio to access the Machine Learning Studio so you can create your first experiment!
When prompted to take a tour select Not Now. You may want to take a tour later when you are exploring this tool on your own.
At the bottom of the screen select +NEW
Change the title at the top of the experiment to read “Breast Cancer Experiment”
Loading the data set
The Wisconsin Breast Cancer data set is not a sample data set already loaded in Azure Machine Learning Studio. The data used in this example is the Wisconsin Breast Cancer data set from the University of Wisconsin hospitals provided by Dr William H. Wolberg you can download the dataset file breast-cancer-wisconsin.data here.
Once you have downloaded the file you will need to create a dataset in Azure Machine Learning Studio for the breast-cancer-wisconsin.data file.
Select + NEW at the bottom of the screen
Select DATASET | FROM LOCAL FILE
1. Select the DATA TO UPLOAD by browsing to select the csv file you downloaded containing the breast cancer data.
2. Enter NAME FOR THE NEW DATASET
3. Specify the TYPE FOR THE NEW DATASET as Generic CSV File with a header (.csv) this indicates we have a csv file and the first row of the csv file contains the headers for the data columns
4. Enter a description of the dataset to help you remember the dataset contents
5. Select the checkmark to start uploading the data into a dataset
Expand Saved Datasets | My Datasets and drag your newly created Breast cancer dataset to the experiment
Right click on the dataset on your worksheet and select dataset | visualize from the pop-up menu, explore the dataset by clicking on different columns. It’s essential in Machine Learning to be familiar with your data. This dataset contains information about pcharacteristics of tumours and whether those tumours were benign or malignant.
- Sample Code number is an id number assigned to the sample
- Clump thikcness
- Univformity of Cell Size
- Uniformity of Cell Shape
- Marginal Adhesion
- Single Epithelial Cell Size
- Bare Nuclei
- Bland Chromatin
- Normal Nucleoli
- Mitoses
- Class 2 for benign, 4 for malignant
We are going to use Machine Learning to create a model that predicts whether a tumour is benign or malignant
Selecting Features for the Machine Learning Experiment
Some of the columns in the dataset are not meaningful for predicting whether a tumour is malignan, for example sample Code number is just a number assigned to each sample.
Let’s select only the significant features in our dataset to use in our machine learning experiment.
Type “Select” into the search bar and drag the Select Columns in Dataset task to the workspace. Connect the output of your dataset to the project columns task input
The Select Columns in Dataset task allows you to specify which columns in the data set you think are significant to a prediction (i.e. your features). You need to look at the data in the dataset and decide which columns represent data that you think will affect whether or not a passenger would survive. You also need to select the column you want to predict. In this case we are going to try to predict the value of Class. This will contain a value of 2 if the tumour is benign and a value of 4 if the tumour is malignant.
Click on the Select Columns in Dataset task. On the properties pane on the right hand side, select Launch column selector Select the columns you think affect whether or not a tumour is malignant as well as the column we want to predict: Class. In the following screenshot, I selected all the columns except Sample Code number
Setting aside data for testing
Whenever we execute machine learning experiments, we use some of our data to train the model and we put some data aside to test the model. In Azure Machine Learning Studio, we use the Split Data task to put aside data for testing.
The Split Data task allows us to divide up our data, we need some of the data to try and find patterns and we need to save some of the data to test if the model we create successfully makes predictions. Traditionally you will split the data 80/20 or 70/30.
Type “split” into the search bar and drag the Split Data task to the workspace. Connect the output of Project Columns task to the input of the Split Data task.
Click on the Split Data task to bring up properties, specify .8 as the Fraction of rows in the first output
Training the model
Now we can get Azure Machine Learning Studio to train the model so we can find the patterns in the historical data to make predictions for new records.
Type “train model” into the search bar. Drag the train model task to the workspace. Connect the first output (the one on the left) of the Split Data task to the rightmost input of the Train model task. This will take 80 % of our data and use it to train/teach our model to make predictions.
We need to tell the train model task which column we are trying to predict with our model. In our case we are trying to predict the value of the column Class which indicates if a tumour is malignant or benign.
Click on the Train Model task. In the properties window select Launch Column Selector. Select the column Class.
If you are a data scientist who creates your own algorithms, you could now import your own R code to try and analyze the patterns. But, we can also use one of the existing built-in standard algorithms.
Different types of machine learning, use different algorithms. Since we are trying to predict if an output has one of two values we want to use a two-class algorithm to train our model. Two-clas algorithms are used to predict outcomes that can only have two possible values. In our case a value of 1 or 0 which indicates survival.
Type “two-class” into the search bar. You will see a number of different classification algorithms listed. Each algorithm has its advantages and disadvantages. Check out the Azure Machine Learning Studio Cheat Sheet for a quick reference guide to algorithm selection. I am going to select the Two-Class Decision Forest to train my model. Select one of the two-class algorithms and drag it to the workspace.
Connect the output of the Algorithm task to the leftmost input of the train model task.
Testing your model
After the model is trained, we need to see how well it makes predictions, so we need to score the model by having it test against the 20% of the data we split to our second output using the Split Data task.
Type “score” into the search bar and drag the Score Model task to the workspace. Connect the output of Train Model to the left input of the Score model task. Connect the right output of the Split Data task to the right input of the Score Model task as shown in the following screenshot.
Now we need a report on our test results.
Type “evaluate” into the search bar and drag the Evaluate Model task to the bottom of the workspace. Connect the output of the Score model task to the left input of the Evaluate Model task.
You are now ready to run your experiment!
Press Run on the bottom toolbar. You will see green checkmarks appear on each task as it completes. When the entire experiment is completed you can check how well your model makes predictions.
How to interpret your results
To see your test results, right click on the evaluate model task and select “ Evaluation results | Visualize”.
The closer the graph is to a straight diagonal line the more your model is guessing randomly. You want your line to get as close to the upper left corner as possible.
If you scroll down you can see the detailed results. AUC (Area Under Curve) is a great overall indicator of your model performance. The closer AUC is to 1, the better your model is making predictions.
You can also see the number of false and true positive and negative predictions
- True positives are how often your model correctly predicted a tumour was a class 4 (malignant)
- False positives are how often your model predicted a tumour was a class 4 (malignant) when it was a class 2 (benign)(i.e your model predicted incorrectly)
- True negatives indicate how often your model correctly predicted a tumour was a class 2 (benign)
- False negatives indicate how often your model predicted a tumour was class 2 (benign) when in fact it was class 4 (malignant) (i.e. your model predicted incorrectly)
You want high values for True positives and True negatives, you want low values for False Positives and False negatives.
Creating a web service for your trained model
Once you have trained a model with a satisfactory level of accuracy, how do you use it? One of the great things about Azure Machine Learning Studio is how easy it is to take your model and deploy it as a web service. Then you can simply have a website or app call the web service, pass in a set of values for the project columns and the web service will return the predicted value and confidence of the result.
Convert the training experiment to a predictive experiment
Once you've trained your model, you're ready to use it to make predictions for new data. To do this, you convert your training experiment into a predictive experiment. By converting to a predictive experiment, you're getting your trained model ready to be deployed as a web service. Users of the web service will send input data to your model and your model will send back the prediction results.
To convert your training experiment to a predictive experiment, click Run at the bottom of the experiment canvas, then select Set Up Web Service
Creating a web service for your trained model
Once you have trained a model with a satisfactory level of accuracy, how do you use it? One of the great things about Azure Machine Learning Studio is how easy it is to take your model and deploy it as a web service. Then you can simply have a website or app call the web service, pass in a set of values for the project columns and the web service will return the predicted value and confidence of the result.
Convert the training experiment to a predictive experiment
Once you've trained your model, you're ready to use it to make predictions for new data. To do this, you convert your training experiment into a predictive experiment. By converting to a predictive experiment, you're getting your trained model ready to be deployed as a web service. Users of the web service will send input data to your model and your model will send back the prediction results.
To convert your training experiment to a predictive experiment, click Run at the bottom of the experiment canvas, then select Set Up Web Service
Select Set Up Web Service, then select Predictive Web Service.
This will create a new predictive experiment for your web service. The predictive model doesn’t have as many components as your original experiment, you will notice a few differences:
- You don’t need the data set because when someone calls the web services they will pass in the data to use for the prediction.
- You still need to identify which columns will be used for predictions if you pass in a full record of data.
- Your algorithm and Train Model tasks have now become a single trained model which will be used to analyze the data passed in and make a prediction
- We don’t need to evaluate the model to test it’s accuracy. All we need is a Score model to return a result from our trained model.
- Two new tasks are added to indicate how the data from the web service is input to the experiment, and how the data from the experiment is returned to the web service.
Delete the connection from the Web input to Select Columns in Dataset task and redraw the connection from the Web input to the Score Model task. If you leave the web input connected to project columns, the web service will prompt you for values for all the data columns even though we don’t use them to make our prediction. If you have the web input connected to the score model directly, the web service will only expect the data columns we selected in our Select Columns in DataSet task which we determined are relevant for making predictions.
For more details on how to do this conversion, see Convert a Machine Learning training experiment to a predictive experiment
Deploy the predictive experiment as a web service
Now that the predictive experiment has been sufficiently prepared, you can deploy it as an Azure web service. Using the web service, users can send data to your model and the model will return its predictions.
To deploy your predictive experiment,
click Run at the bottom of the experiment canvas
After it runs successfully
Select Deploy Web Service. The web service is set up and you are placed in the web service dashboard.
Test the web service
Select the Test link in the web service dashboard. A dialog pops up to ask you for the input data for the service. These are the columns expected by the scoring experiment. Enter a set of data and then select OK. The results generated by the web service are displayed at the bottom of the dashboard.
You may have to scroll down to see all the fields you need to enter
The results of the test will appear at the bottom of the screen.
Select Details to see the full record returned
You will see the record you entered followed by the predicted output and the probability (columns scored label, and scored probabilities respectively). In the screenshot below there is a .375 (37.5%) probability my imaginary tumour is benign on the titanic (predited outcome for class is 2). The value you see returned will vary depending on the data you specified.
Calling the web service from your code
Once you deploy your web service from Machine Learning Studio, you can send data to the service and receive responses programmatically.
The dashboard provides all the information you need to access your web service. For example, the API key is provided to allow authorized access to the service, and API help pages are provided to help you get started writing your code. Select Request/Response if you are going to call the web service passing one record at a time. Select Batch Execution if you are going to pass multiple records to the web service at a time.
On the API help page select Sample code
You will be presented with code samples for calling the web service from C#, Python and R
Replace the apiKey of abc123 with the API key displayed in the dashboard of your web service.
Replace the values with the values you wish to pass into the web service and you can now call the web service from your code to retrieve predictions!
For more information about accessing a Machine Learning web service, see How to consume a deployed Azure Machine Learning web service.
Congratulations you have created a machine learning experiment and a web service to make predictions based on your trained model!