R tutorial: Predict NYC taxi fares with binary classification

Applies to: SQL Server 2016 (13.x) and later Azure SQL Managed Instance

In this five-part tutorial series for SQL programmers, you'll learn about R integration in SQL Server Machine Learning Services or on Big Data Clusters.

In this five-part tutorial series for SQL programmers, you'll learn about R integration in SQL Server Machine Learning Services.

In this five-part tutorial series for SQL programmers, you'll learn about R integration in SQL Server 2016 R Services.

In this five-part tutorial series for SQL programmers, you'll learn about R integration in Machine Learning Services in Azure SQL Managed Instance.

You'll build and deploy an R-based machine learning solution using a sample database on SQL Server. You'll use T-SQL, Azure Data Studio or SQL Server Management Studio, and a database engine instance with SQL machine learning and R language support

This tutorial series introduces you to R functions used in a data modeling workflow. Parts include data exploration, building and training a binary classification model, and model deployment. You'll use sample data from the New York City Taxi and Limousine Commission. The model you'll build predicts whether a trip is likely to result in a tip based on the time of day, distance traveled, and pick-up location.

In the first part of this series, you'll install the prerequisites and restore the sample database. In parts two and three, you'll develop some R scripts to prepare your data and train a machine learning model. Then, in parts four and five, you'll run those R scripts inside the database using T-SQL stored procedures.

In this article, you'll:

  • Install prerequisites
  • Restore the sample database

In part two, you'll explore the sample data and generate some plots.

In part three, you'll learn how to create features from raw data by using a Transact-SQL function. You'll then call that function from a stored procedure to create a table that contains the feature values.

In part four, you'll load the modules and call the necessary functions to create and train the model using a SQL Server stored procedure.

In part five, you'll learn how to operationalize the models that you trained and saved in part four.

Note

This tutorial is available in both R and Python. For the Python version, see Python tutorial: Predict NYC taxi fares with binary classification.

Prerequisites

All tasks can be done using Transact-SQL stored procedures in Azure Data Studio or Management Studio.

This tutorial assumes familiarity with basic database operations such as creating databases and tables, importing data, and writing SQL queries. It does not assume you know R and all R code is provided.

Background for SQL developers

The process of building a machine learning solution is a complex one that can involve multiple tools, and the coordination of subject matter experts across several phases:

  • obtaining and cleaning data
  • exploring the data and building features useful for modeling
  • training and tuning the model
  • deployment to production

Development and testing of the actual code is best performed using a dedicated R development environment. However, after the script is fully tested, you can easily deploy it to SQL Server using Transact-SQL stored procedures in the familiar environment of Azure Data Studio or Management Studio. Wrapping external code in stored procedures is the primary mechanism for operationalizing code in SQL Server.

After the model has been saved to the database, you can call the model for predictions from Transact-SQL by using stored procedures.

Whether you're a SQL programmer new to R, or an R developer new to SQL, this five-part tutorial series introduces a typical workflow for conducting in-database analytics with R and SQL Server.

Next steps

In this article, you:

  • Installed prerequisites
  • Restored the sample database