Leveraging Microsoft R and in database analytics of SQL Server with R Services through Alteryx Designer
This post is authored by Bharath Sankaranarayan, Principal Program Manager at Microsoft.
Being part of the Microsoft R Product team I get to use Microsoft R regularly, but not often with a drag-and-drop editor that simplifies the effort needed to build a solution using machine learning. In this blog, I will take you through a tour of how Microsoft R Server and SQL Server 2016 with R Services work in Alteryx, something that the Microsoft R product team and Alteryx worked on. The release of Alteryx Designer 11 added capabilities of in-database analytics that was made available in SQL Server 2016. I am going to assume you have Alteryx set up. If not, you can get a trial version of the designer through Alteryx.
I wanted to try how to model training and scoring (prediction) work with Alteryx Designer without writing code, and see how in-database analytics work with SQL Server using the in-DB tools that Alteryx has built. I also wanted to do a comparison with the local Microsoft R version based on Open R. As I already had Microsoft R Server 9.0.1 installed locally, when I installed the Alteryx Predictive Analytics Tools it detected that I had Microsoft R Server and enabled the features. You can follow the instructions here for your needs.
The scenario that I had was a simple one in that I wanted to target customers who are likely to respond to a new marketing campaign as it related closely to the Marketing Campaign Optimization solution that we released. The dataset used was synthetic but nevertheless was good for this. For model training, I used the data from customers who responded to a direct mail and this was labelled data. I also had the data with past transaction data of the customer spend and frequency. Lastly, I used the demographic data that had the geography and the age band. I used these datasets to augment the training dataset. For scoring (prediction) I used the customers with spend and demographic data. As the goal was to determine the likelihood of a customer who is likely to respond, the problem fell into a classification problem.
Alteryx Designer uses workflows and you can think of this as the canvas where you drop the widgets and then you connect the widgets. I hardly had to write much code! All my data was in CSV files but it was not a problem, I loaded the data over to a SQL Server database in Azure. While configuring the SQL Server 2016 VM on Azure I enabled it for R Services, an option that you can do on the Azure Portal. You also have the option of using a Data Science VM that has all the tools ready to go for you including SQL Server, which is packed with a slew of other goodies. I configured the connection to SQL Server through Alteryx and used the ODBC Native Client, but you could use SQL OLEDB or SQL Server – all of them work and this is fantastic.
As I was focused on leveraging in-database capabilities I pushed all the data to SQL Server. The loading of the data was a breeze using the Data Stream-in widget. I used the same widget to load the demographic and the customer spend data. I could quickly review the data via the Browse in-DB widget and filtered to browse the first 100 rows, something that you can configure. Likewise, it was easy to do filtering, joining and de-duping.
The next step was to determine the best algorithm and as this being a binary classification problem I decided to use the Logistic Regression model. As soon as I dropped the Logistic Regression widget on the canvas, the designer automatically knew I was connecting to a SQL Server data source and provided me the in-DB version of the predictive tool. All I had to do was to connect to the right nodes on the widget for modeling and then the right data for scoring. Once I did this I picked the target variable (respond) and the predictor variables (spend and visits) all using the Alteryx Designer.
Once I verified I had the right connections and verified them after fixing the errors which were primarily due to lack of intimate knowledge, I got the workflow going. The Alteryx Designer visually shows which components are executing and provides a summary upon completion (Figure 3). Once I had the basic model working I started adding additional features such as reporting to validate my model all the way to exporting the data from the scoring over to a variety of formats including writing to the database.
While I did the classic Logistic regression, I also tried a few more models like Forest and Decision Trees and it was equally easy to implement and the results were comparable. To compare how the in-DB version performed against the Logistic Regression model that is included with Microsoft R Server I included it in my canvas. Both performed nearly the same in terms of the results but the local version seemed to be a bit faster and as my data size was not very large this was the reason.
Having worked with R, not writing code felt a bit weird but I soon started to get comfortable using a visual editor and started to enjoy using it. I will admit that things did not work on first try akin to coding, but I did not feel that I could not complete the work. I barely knew how to use the Alteryx tool, but within a few days I was being productive. I have only scratched the surface, yet Alteryx designer together with Microsoft R Server combined with in-database analytics with R Services is packed with features that will meet your entire enterprise needs!