Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 4
The previous posts of this series (part 1, part 2, and part 3) focused on using publically-available crime report data to build a "real world" database in Azure, then showed how you can use some of the Power BI tools in Microsoft Excel to create interesting visualizations with that data.
In this post, I'll walk through how to manually update the database with the latest data from the City of Chicago data portal so that your visualizations will be current. In future posts we'll discuss how to automate this process, but for now, we'll use a simplistic manual process.
Obtaining the Latest Data
The first step in updating the database is to understand what date and time the last police report was filed that you have added to your database. In order to accomplish this, you will need to use SQL Management Studio and connect to your database (if you need a refresher on how to do this, see the other posts in this series) and execute the following query:
SELECT MAX([Date]) FROM ChicagoCrimes;
This will return a single row with a DATETIME value such as: 2015-07-01 23:58:00.000
The way that that the City of Chicago Data Portal works is that they make available their crime data based on a 24-hour cycle, and they post complete cycles to the portal on a 7-day lag. (In other words, generally speaking the data in the portal is the last complete day from 1 week ago). Based on the date above (7/1/2015 at 2 minutes to midnight – Iron Maiden reference not intended J ) we can assume that all police reports through July 1 2015 have been uploaded to our database and we'll need to start with police reports from July 2.
To download the latest police report data, browse to the City of Chicago's data portal which is available at https://data.cityofchicago.org/ and select the Crimes – 2001 to present link:
This will open the dataset in a browser window with several options available to you:
In the upper-right-hand section of the window, you will see a Filter option. Select that option to open the filter pane:
Select the Add a New Filter Condition button to add a filter. Select the Date field, and then select the Is After option, and then select the target date (in my example, July 2) as shown:
After a few moments, the window will refresh and you will have a much smaller dataset than what you started with. Once the screen refreshes, in the upper-right-hand side of the screen, select the Export button:
This will open the Download dialog, where you will want to choose the CSV for Excel option and then save the file to your local hard drive.
Once the file is saved, you will be ready to upload it to your Azure database environment.
Updating the Database
The first post in this series discussed downloading the initial data from the City of Chicago and uploading it to a temporary table in Azure SQL Database. The first part of the process to update the data is exactly the same. The first step will be to upload the new data to a temporary table (for the purposes of this post, we'll use the same table we created during that process) and then we'll use T-SQL code to update the fact table in our database.
Preparing the Temporary Table
Before we upload the new data, we will want to truncate the temporary table. This step is not technically necessary as the T-SQL query we'll run will not insert duplicates into the fact table, however for efficiencies sake we'll want to start with a blank table.
Connect to your Azure SQL Database instance and execute the following T-SQL command:
TRUNCATE TABLE CRIMES_NEW;
This will ensure that the temporary table that we used during the initial creation of the database is ready to use again. If for some reason you deleted the initial temporary table, you can create it with the following T-SQL script:
CREATE TABLE [dbo].[Crimes_New](
[ID] [varchar](50) NULL PRIMARY KEY,
[Case Number] [varchar](50) NULL,
[Date] [varchar](50) NULL,
[Block] [varchar](50) NULL,
[IUCR] [varchar](50) NULL,
[Primary Type] [varchar](50) NULL,
[Description] [varchar](max) NULL,
[Location Description] [varchar](50) NULL,
[Arrest] [varchar](50) NULL,
[Domestic] [varchar](50) NULL,
[Beat] [varchar](50) NULL,
[District] [varchar](50) NULL,
[Ward] [varchar](50) NULL,
[Community Area] [varchar](50) NULL,
[FBI Code] [varchar](50) NULL,
[X Coordinate] [varchar](50) NULL,
[Y Coordinate] [varchar](50) NULL,
[Year] [varchar](50) NULL,
[Updated On] [varchar](50) NULL,
[Latitude] [varchar](50) NULL,
[Longitude] [varchar](50) NULL,
[Location] [varchar](50) NULL
);
Once the table is truncated, we're ready to use the Import/Export Wizard to upload the new data.
Uploading New Data
Start the SQL Server Import Export Wizard (From the Start menu, search for SQL Server Import and Export Data) and then choose a flat file source from the Data source drop down menu. Browse for the file that you downloaded earlier, and add a double quote to the Text Qualifier textbox:
Select the Columns option on the left-hand side and verify that the columns are aligned properly:
Then select the Advanced option on the left-hand side and select the Description column. Change the Data Type to text stream [DT_TEXT] to ensure that the full width of the Description column will be uploaded.
Click Next and then select the SQL Server Native Client x.x (if you have SQL Server 2014 tools installed, your Native Client version will be 11.0) option from the Destination drop down. Enter your Azure SQL Database server name in the Server name: textbox and then choose the Use SQL Server Authentication radio button. Enter your User name and Password to connect to the Azure SQL Database instance and then select the ChicagoCrime database in the Database drop down menu.
Choose Next, and then select the [dbo].[Crimes_New] table in the Destination:<servername> field:
Click Next, and ensure the Run immediately checkbox is selected and then click Finish to begin the import.
Once the import is complete (depending on how much time has elapsed between the initial load and the update, this process will either be very quick or could take a few minutes) you will see the following:
Click Close to close the Import Export Wizard. Once the temporary table has been loaded, you can connect to the database and transfer the data to the fact table in the database.
Updating the Fact Table with the New Data
Adding the new data to the ChicagoCrimes Fact table is a relatively straightforward process. You could easily modify the process above to load the data directly into the fact table from the text file, however by separating the process of uploading from updating, you are ensuring that any errors that occur can be corrected easily.
To update the fact table, connect to your database instance using SQL Server Management Studio and execute the following script:
INSERT ChicagoCrimes
SELECT [ID]
,[Case Number]
,[Date]
,[Block]
,[IUCR]
,[Primary Type]
,[Description]
,[Location Description]
,[Arrest]
,[Domestic]
,[Beat]
,[District]
,[Ward]
,[Community Area]
,[FBI Code]
,[X Coordinate]
,[Y Coordinate]
,[Year]
,[Updated On]
,[Latitude]
,[Longitude]
,[Location]
,CAST([Date] AS DATE)
FROM [dbo].[Crimes_New]
WHERE [Date] > (SELECT MAX([Date]) FROM ChicagoCrimes);
When the script finishes, you should see the same number of rows affected as you saw transferred in the Import Export Wizard:
Once the script is complete, the fact table in your database will be updated with the latest information available. You can verify that the data was updated properly by re-running the following T-SQL query and comparing the date returned with the initial date from above:
SELECT MAX([Date]) FROM ChicagoCrimes;
You should see a newer date returned.
Updating the Excel Workbook
In Part 3 of this series we used an Excel Workbook, a Power Pivot model and Power View to create a simple visualization of crime data. Once the Fact Table has been updated, you'll want to return to the workbook and refresh the data to ensure the new rows are loaded into the visualizations.
Start Excel and open the workbook you created in Part 3, and when the workbook is opened, click the Enable Content button:
Select Power Pivot from the ribbon and select the Manage command to open the Power Pivot window:
From the ribbon, select the Refresh drop-down and select Refresh All and when prompted, enter the password for the database. When the refresh is complete, you should see the following dialog:
Note that the refresh will repopulate the entire Power Pivot model, not just the added rows.
Conclusion
In this post, we downloaded the latest information from the City of Chicago data portal and manually updated the database with the new information. You also updated the Excel Workbook with visualizations that you created earlier. This process can be re-run any time you want to add updated data to your database.
Future posts in this series will detail how to use the Azure Data Factory to automate the process of updating the data. I will also be creating a new series of posts that will use the data from this series to create an Azure Machine Learning predictive analytics model.