R tutorial: Predict NYC taxi fares with binary classification
Applies to: SQL Server 2016 (13.x) and later Azure SQL Managed Instance
In this five-part tutorial series for SQL programmers, you'll learn about R integration in SQL Server Machine Learning Services or on Big Data Clusters.
In this five-part tutorial series for SQL programmers, you'll learn about R integration in SQL Server Machine Learning Services.
In this five-part tutorial series for SQL programmers, you'll learn about R integration in SQL Server 2016 R Services.
In this five-part tutorial series for SQL programmers, you'll learn about R integration in Machine Learning Services in Azure SQL Managed Instance.
You'll build and deploy an R-based machine learning solution using a sample database on SQL Server. You'll use T-SQL, Azure Data Studio or SQL Server Management Studio, and a database engine instance with SQL machine learning and R language support
This tutorial series introduces you to R functions used in a data modeling workflow. Parts include data exploration, building and training a binary classification model, and model deployment. You'll use sample data from the New York City Taxi and Limousine Commission. The model you'll build predicts whether a trip is likely to result in a tip based on the time of day, distance traveled, and pick-up location.
In the first part of this series, you'll install the prerequisites and restore the sample database. In parts two and three, you'll develop some R scripts to prepare your data and train a machine learning model. Then, in parts four and five, you'll run those R scripts inside the database using T-SQL stored procedures.
In this article, you'll:
- Install prerequisites
- Restore the sample database
In part two, you'll explore the sample data and generate some plots.
In part three, you'll learn how to create features from raw data by using a Transact-SQL function. You'll then call that function from a stored procedure to create a table that contains the feature values.
In part four, you'll load the modules and call the necessary functions to create and train the model using a SQL Server stored procedure.
In part five, you'll learn how to operationalize the models that you trained and saved in part four.
Note
This tutorial is available in both R and Python. For the Python version, see Python tutorial: Predict NYC taxi fares with binary classification.
Prerequisites
- Install SQL Server 2016 R Services
- Starting in SQL Server 2019, the isolation mechanism requires you to give appropriate permissions to the directory where the plot file is stored. For more information on how to set these permissions, see the File permissions section in SQL Server 2019 on Windows: Isolation changes for Machine Learning Services.
- Restore the NYC Taxi demo database
All tasks can be done using Transact-SQL stored procedures in Azure Data Studio or Management Studio.
This tutorial assumes familiarity with basic database operations such as creating databases and tables, importing data, and writing SQL queries. It does not assume you know R and all R code is provided.
Background for SQL developers
The process of building a machine learning solution is a complex one that can involve multiple tools, and the coordination of subject matter experts across several phases:
- obtaining and cleaning data
- exploring the data and building features useful for modeling
- training and tuning the model
- deployment to production
Development and testing of the actual code is best performed using a dedicated R development environment. However, after the script is fully tested, you can easily deploy it to SQL Server using Transact-SQL stored procedures in the familiar environment of Azure Data Studio or Management Studio. Wrapping external code in stored procedures is the primary mechanism for operationalizing code in SQL Server.
After the model has been saved to the database, you can call the model for predictions from Transact-SQL by using stored procedures.
Whether you're a SQL programmer new to R, or an R developer new to SQL, this five-part tutorial series introduces a typical workflow for conducting in-database analytics with R and SQL Server.
Next steps
In this article, you:
- Installed prerequisites
- Restored the sample database