Python Tutorial: Prepare data to train a linear regression model with SQL machine learning
Applies to: SQL Server 2017 (14.x) and later Azure SQL Managed Instance
In part two of this four-part tutorial series, you'll prepare data from a database using Python. Later in this series, you'll use this data to train and deploy a linear regression model in Python with SQL Server Machine Learning Services or on SQL Server 2019 Big Data Clusters.
In part two of this four-part tutorial series, you'll prepare data from a database using Python. Later in this series, you'll use this data to train and deploy a linear regression model in Python with SQL Server Machine Learning Services.
In part two of this four-part tutorial series, you'll prepare data from a database using Python. Later in this series, you'll use this data to train and deploy a linear regression model in Python with Azure SQL Managed Instance Machine Learning Services.
In this article, you'll learn how to:
- Load the data from the database into a pandas data frame
- Prepare the data in Python by removing some columns
In part one, you learned how to restore the sample database.
In part three, you'll learn how to train a linear regression machine learning model in Python.
In part four, you'll learn how to store the model in a database, and then create stored procedures from the Python scripts you developed in parts two and three. The stored procedures will run on the server to make predictions based on new data.
Prerequisites
- Part two of this tutorial assumes you have completed part one and its prerequisites, including installing the necessary Python packages,
pandas
andpyodbc
.
Explore and prepare the data
To use the data in Python, you'll load the data from the database into a pandas data frame.
Create a new Python notebook in Azure Data Studio and run the script below.
The Python script below imports the dataset from the dbo.rental_data
table in your database to a pandas data frame df.
In the connection string, replace connection details as needed. To use Windows authentication with an ODBC connection string, specify Trusted_Connection=Yes;
instead of the UID
and PWD
parameters.
import pyodbc
import pandas
# Connection string to your SQL Server instance
conn_str = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server}; SERVER=<server>; DATABASE=TutorialDB;UID=<username>;PWD=<password>')
query_str = 'SELECT Year, Month, Day, Rentalcount, Weekday, Holiday, Snow FROM dbo.rental_data'
df = pandas.read_sql(sql=query_str, con=conn_str)
print("Data frame:", df)
You should see results similar to the following.
Data frame: Year Month Day Rentalcount WeekDay Holiday Snow
0 2014 1 20 445 2 1 0
1 2014 2 13 40 5 0 0
2 2013 3 10 456 1 0 0
3 2014 3 31 38 2 0 0
4 2014 4 24 23 5 0 0
.. ... ... ... ... ... ... ...
448 2013 2 19 57 3 0 1
449 2015 3 18 26 4 0 0
450 2015 3 24 29 3 0 1
451 2014 3 26 50 4 0 1
452 2015 12 6 377 1 0 1
[453 rows x 7 columns]
Filter the columns from the dataframe to remove ones we don't want to use in the training. Rentalcount
should not be included as it is the target of the predictions.
columns = df.columns.tolist()
columns = [c for c in columns if c not in ["Year", "Rentalcount"]]
print("Training set:", df[columns])
Note the data the training set will have access to:
Training set: Month Day Weekday Holiday Snow
1 2 13 5 0 0
3 3 31 2 0 0
7 3 8 7 0 0
15 3 4 2 0 1
22 1 18 1 0 0
.. ... ... ... ... ...
416 4 13 1 0 1
421 1 21 3 0 1
438 2 19 4 0 1
441 2 3 3 0 1
447 1 4 6 0 1
[91 rows x 5 columns]
Next steps
In part two of this tutorial series, you completed these steps:
- Load the data from the database into a pandas data frame
- Prepare the data in Python by removing some columns
To train a machine learning model that uses data from the TutorialDB
database, follow part three of this tutorial series: