Python tutorial: Predict NYC taxi fares with binary classification

Applies to: SQL Server 2017 (14.x) and later Azure SQL Managed Instance

In this five-part tutorial series for SQL programmers, you'll learn about Python integration in SQL Server Machine Learning Services or on Big Data Clusters.

In this five-part tutorial series for SQL programmers, you'll learn about Python integration in SQL Server Machine Learning Services.

In this five-part tutorial series for SQL programmers, you'll learn about Python integration in Machine Learning Services in Azure SQL Managed Instance.

You'll build and deploy a Python-based machine learning solution using a sample database on SQL Server. You'll use T-SQL, Azure Data Studio or SQL Server Management Studio, and a database instance with SQL machine learning and Python language support.

This tutorial series introduces you to Python functions used in a data modeling workflow. Parts include data exploration, building and training a binary classification model, and model deployment. You'll use sample data from the New York City Taxi and Limousine Commission. The model you'll build predicts whether a trip is likely to result in a tip based on the time of day, distance traveled, and pick-up location.

In the first part of this series, you'll install the prerequisites and restore the sample database. In parts two and three, you'll develop some Python scripts to prepare your data and train a machine learning model. Then, in parts four and five, you'll run those Python scripts inside the database using T-SQL stored procedures.

In this article, you'll:

  • Install prerequisites
  • Restore the sample database

In part two, you'll explore the sample data and generate some plots.

In part three, you'll learn how to create features from raw data by using a Transact-SQL function. You'll then call that function from a stored procedure to create a table that contains the feature values.

In part four, you'll load the modules and call the necessary functions to create and train the model using a SQL Server stored procedure.

In part five, you'll learn how to operationalize the models that you trained and saved in part four.

Note

This tutorial is available in both R and Python. For the R version, see R tutorial: Predict NYC taxi fares with binary classification.

Prerequisites

All tasks can be done using Transact-SQL stored procedures in Azure Data Studio or Management Studio.

This tutorial series assumes familiarity with basic database operations such as creating databases and tables, importing data, and writing SQL queries. It does not assume you know Python and all Python code is provided.

Background for SQL developers

The process of building a machine learning solution is a complex one that can involve multiple tools, and the coordination of subject matter experts across several phases:

  • obtaining and cleaning data
  • exploring the data and building features useful for modeling
  • training and tuning the model
  • deployment to production

Development and testing of the actual code is best performed using a dedicated development environment. However, after the script is fully tested, you can easily deploy it to SQL Server using Transact-SQL stored procedures in the familiar environment of Azure Data Studio or Management Studio. Wrapping external code in stored procedures is the primary mechanism for operationalizing code in SQL Server.

After the model has been saved to the database, you can call the model for predictions from Transact-SQL by using stored procedures.

Whether you're a SQL programmer new to Python, or a Python developer new to SQL, this five-part tutorial series introduces a typical workflow for conducting in-database analytics with Python and SQL Server.

Next steps

In this article, you:

  • Installed prerequisites
  • Restored the sample database