NYC Taxi demo data for SQL Server Python and R tutorials
Applies to: SQL Server 2016 (13.x) and later Azure SQL Managed Instance
This article explains how to set up a sample database consisting of public data from the New York City Taxi and Limousine Commission. This data is used in several R and Python tutorials for in-database analytics on SQL Server. To make the sample code run quicker, we created a representative 1% sampling of the data. On your system, the database backup file is slightly over 90 MB, providing 1.7 million rows in the primary data table.
To complete this exercise, you should have SQL Server Management Studio (SSMS) or another tool that can restore a database backup file and run T-SQL queries.
Tutorials and quickstarts using this data set include the following:
- Learn in-database analytics using R in SQL Server
- Learn in-database analytics using Python in SQL Server
The sample database is a SQL Server 2016 BAK file hosted by Microsoft. You can restore it on SQL Server 2016 and later. File download begins immediately when you open the link.
File size is approximately 90 MB.
To restore the sample database on SQL Server Big Data Clusters, download NYCTaxi_Sample.bak and follow the directions in Restore a database into the SQL Server big data cluster master instance.
To restore the sample database on Machine Learning Services in Azure SQL Managed Instance, follow the instructions in Quickstart: Restore a database to Azure SQL Managed Instance using the NYC Taxi demo database .bak file: https://aka.ms/sqlmldocument/NYCTaxi_Sample.bak.
Download the NYCTaxi_Sample.bak database backup file.
Copy the file to
C:\Program files\Microsoft SQL Server\MSSQL-instance-name\MSSQL\Backupor similar path, for your instance's default
In SSMS, right-click Databases and select Restore Files and File Groups.
NYCTaxi_Sampleas the database name.
Select From device and then open the file selection page to select the
NYCTaxi_Sample.bakbackup file. Select Add to select
Select the Restore checkbox and select OK to restore the database.
Review database objects
Confirm the database objects exist on the SQL Server instance using SQL Server Management Studio. You should see the database, tables, functions, and stored procedures.
Objects in NYCTaxi_Sample database
The following table summarizes the objects created in the NYC Taxi demo database.
|Object name||Object type||Description|
|NYCTaxi_Sample||database||Creates a database and two tables:
|fnCalculateDistance||scalar-valued function||Calculates the direct distance between pickup and dropoff locations. This function is used in Create data features, Train and save a model and Operationalize the R model.|
|fnEngineerFeatures||table-valued function||Creates new data features for model training. This function is used in Create data features and Operationalize the R model.|
Stored procedures are created using R and Python script found in various tutorials. The following table summarizes the stored procedures that you can optionally add to the NYC Taxi demo database when you run script from various lessons.
|RxPlotHistogram||R||Calls the RevoScaleR
|RPlotRHist||R||Creates a graphic using the
|RxTrainLogitModel||R||Trains a logistic regression model by calling an R package. The model predicts the value of the
|RxPredictBatchOutput||R||Calls the trained model to create predictions using the model. The stored procedure accepts a query as its input parameter and returns a column of numeric values containing the scores for the input rows. This stored procedure is used in Predict potential outcomes.|
|RxPredictSingleRow||R||Calls the trained model to create predictions using the model. This stored procedure accepts a new observation as input, with individual feature values passed as in-line parameters, and returns a value that predicts the outcome for the new observation. This stored procedure is used in Predict potential outcomes.|
Query the data
As a validation step, run a query to confirm the data was uploaded.
In Object Explorer, under Databases, right-click the NYCTaxi_Sample database, and start a new query.
Run some simple queries:
SELECT TOP(10) * FROM dbo.nyctaxi_sample; SELECT COUNT(*) FROM dbo.nyctaxi_sample;
The database contains 1.7 million rows.
Within the database is a
dbo.nyctaxi_sampletable that contains the data set. The table has been optimized for set-based calculations with the addition of a columnstore index. Run this statement to generate a quick summary on the table.
SELECT DISTINCT [passenger_count] , ROUND (SUM ([fare_amount]),0) as TotalFares , ROUND (AVG ([fare_amount]),0) as AvgFares FROM [dbo].[nyctaxi_sample] GROUP BY [passenger_count] ORDER BY AvgFares DESC
Results should be similar to those showing in the following screenshot.
NYC Taxi sample data is now available for hands-on learning.