Compartir a través de


Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 3

In the previous posts of this series (Part 1 and Part 2) I walked through the creation and population of an Azure SQL Database that contains "real world" crime statistics (with granularity down to the individual police report). This is a very simple database design, but it does contain a fair amount of data (over 5M rows in the fact table if you used the entire dataset) and provides enough diversity to allow for some very interesting analytics queries.

In this post, I'll detail creating some very specific indexes that will support basic analytics, and will also walk through some analysis of the data using Microsoft Excel and Power Pivot, along with Excel Power View. If you don't already have these products installed, you can obtain them by signing up for an Office365 ProPlus subscription (which gives you a ton of additional features as well). In many cases you can obtain these products through your companies "Home Use" program if they have a Microsoft Enterprise Agreement.

Create Indexes to Support Queries

Because the data model that we are working with is a relatively simple star schema, we don't have to worry too much about complicated index analysis to create indexes that will support our queries.

As a review, this is what the data model looks like:

Each of the tables in the schema has a Primary Key constraint and the fact table has Foreign Key constraints that reference each of the Dimension Tables. Each of the Primary Keys are implemented by creating a Clustered Index which means that each table is automatically sorted by the Primary Key column. For the most part, this is good if we are doing a lot of targeted searches of our data and looking up crime statistics by the primary key (for example, in the ChicagoCrimes fact table, the primary key is defined on the ID column. We likely will never use the ID column to search for a crime, so we will want to create additional indexes to support how we actually use the data), but it is not very efficient when you execute queries that look up data (or summarize data) by a different field.

In order to understand what additional indexes you will need, it's important to understand what types of questions that you will be asking of your data. For the purposes of this blog post, we will limit these questions to the following:

  • How many murders have been committed? How many murders occur each year?
    • For this question, we will be searching the data by the Primary Type field in the fact table.
  • What type of location (street, sidewalk, etc.) has the majority of crimes?
    • For this question, we will be searching the data by the Location Description field.
  • What part of the city has the most crime?
    • For this question, we will be searching the data by the Community Area field.
  • Which police station is the busiest?
    • For this question, we will be searching the data by the District field.

In SQL Server (and Azure SQL DB) we can only have a single Clustered Index. This Index defines the storage for the table and the data will always be ordered according to the Clustered Index. In order to create additional indexes, we will need to make them Non-Clustered, which means that additional storage will be used for the Index keys.

For each of the above questions, we'll create an additional non-clustered index on the fields in the fact table that will be used. Since we'll also be interested in summarizing reports by the date that they occurred, we'll also create an index on the DateOnly field.

Creating the Indexes

In order to create the additional indexes, you'll use the SQL Server Management Studio and connect directly to the database instance in Azure. Start SQL Management Studio and connect to your database in Azure (if you need a reminder on how to do this, see Part 1 of this blog series). Ensure that you have selected the ChicagoCrime database and execute the following T-SQL script for each index you need to create (note that each script will take several minutes to run):

Primary Type Index

CREATE NONCLUSTERED INDEX [IDX_PrimaryType] ON [dbo].[ChicagoCrimes]

(

    [Primary Type] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Location Description Index

CREATE NONCLUSTERED INDEX [IDX_LocationDescription] ON [dbo].[ChicagoCrimes]

(

    [Location Description] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Community Area Index

CREATE NONCLUSTERED INDEX [IDX_CommunityArea] ON [dbo].[ChicagoCrimes]

(

    [Community Area] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

District Index

CREATE NONCLUSTERED INDEX [idx_District] ON [dbo].[ChicagoCrimes]

(

    [District] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

Date Index

CREATE NONCLUSTERED INDEX [idx_Date] ON [dbo].[ChicagoCrimes]

(

    [DateOnly] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);

 

Now that the indexes are created, we can run some simple queries to answer some questions about the data.

Asking Questions of the Data

If we understand the structure of the data and know some specific questions that we want to ask, we can use T-SQL to write a query that will return a dataset that answers the question. For example, to implement the questions above, we can use the following T-SQL queries from SQL Management Studio while connected to the database:

Murders by Year

SELECT

    DATEPART(yy,[DateOnly]) AS [Year]

    ,COUNT([Primary Type]) AS NumberMurders

FROM ChicagoCrimes

WHERE [Primary Type] = 'Homicide'

GROUP BY DATEPART(yy,[DateOnly])

ORDER BY [Year];

Top 10 Crime Locations

SELECT TOP 10

    [Primary Type] AS CrimeType

    ,[Location Description] AS [Location]

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes

GROUP BY [Location Description],[Primary Type]

ORDER BY NumCrimes DESC;

Top 10 Worst Crime Areas

SELECT TOP 10

    [Community]

    ,[Primary Type] AS CrimeType

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN Community c

ON cc.[Community Area] = c.Community_id

GROUP BY cc.[Primary Type],c.[Community]

ORDER BY NumCrimes DESC;

 

(note: This query is a little misleading because some of the crime reports do not include a community, therefore they are assigned to "Chicago")

Busiest Police Stations

SELECT

    [Primary Type] AS CrimeType

    ,ps.[Address] AS PoliceDistrict

    ,COUNT(*) AS NumCrimes

FROM ChicagoCrimes cc

JOIN PoliceStations ps

ON cc.[District] = ps.[District]

GROUP BY ps.[Address],cc.[Primary Type]

ORDER BY NumCrimes DESC;

 

These queries will give you a good idea of the types of questions that you can ask of your data with T-SQL, however as you can see, for every question that you ask you need to execute a new query and wait for the results. For the most part, when analyzing data, it is a good idea to visualize the data in a tool that end-users are comfortable with and that can support ad-hoc analysis of the data. For the purposes of this blog series, we'll use Microsoft Excel for the analysis.

Connecting Microsoft Excel to the Data

Even though the database we've created is relatively large (if you pulled in the full dataset, you'll have more than 5.5 million records in the fact table) Microsoft Excel is a viable tool to visualize the data. For the purposes of this blog series, we'll use a technology embedded in Excel called Power Pivot, which is based on an in-memory column-oriented database technology called xVelocity that makes use of very efficient compression techniques to store massive amounts of data in relatively little memory. As mentioned above, if you do not have a copy of Excel from Office 2013 installed on your machine, you can sign up for an Office 365 subscription to obtain it.

Using Power Pivot to Download Data

The first step to analyzing data with Excel is to connect Power Pivot to the data. Power Pivot should be an option on the Excel Ribbon when you start Excel as shown below (I have a few extra plug-ins enabled, so don't expect to see all of the ribbon items):

If you do not see the Power Pivot Ribbon item, make sure that it is enabled by selecting File, then Options, then select COM Add-Ins and select Go. Make sure that Power Pivot (and while you are here, ensure that Power View is enabled as well) is enabled:

Once you have Power Pivot enabled, select the Power Pivot ribbon item and then select the Manage option:

Select the From Database option, select SQL Server, and then enter your database connection information:

Select Next, and then select the Select from a list of …. option:

Select the ChicagoCrimes table, and then select the Select Related Tables option (note my database has additional tables that have not been discussed yet):

Click Finish to begin the import of data:

The import operation will take several minutes to complete. When it's done press Close and you will see the data represented in the Power Pivot management interface:

At the bottom of the screen you'll see the various tables represented. Click through each of them to see the data stored in Power Pivot, then click on Diagram View to view a data diagram. (Which is automatically created with the relationships inherited from the data model)

Close the Power Pivot window (select the X in the upper right) and then save the Excel workbook. Note that even with more 5 million rows in the fact table the total size of the workbook is under 500MB.

From the blank Excel worksheet, choose the Insert command on the ribbon and then select Power View:

Note that you will see a blank report canvas and a list of fields available. (if you do not have Silverlight installed on your machine, you will be prompted to install it. Silverlight is required for the embedded version of Power View at this time)

For this example, we will implement a simple dashboard that will answer the same questions that we used above. To start the process, expand the ChicagoCrimes field on the right side and select the Primary Type field. In the FIELDS box, change the drop-down option to Count (Not Blank)

Then scroll down and expand the DateDimension field. Select the Year item and then in the FIELDS area, drag the Year item above the # Count of Primary Type. Then select the Year item and change the drop down to Do Not Summarize. This results in a table that represents the total number of crimes per year and should look like this:

In the Power View Fields list, scroll back to the ChicagoCrimes item and drag the Primary Type field to the Filters area. Select Homicide from the list of Primary Type fields. This will result in a view that shows the total number of Homicides per year.

Once you have the table showing the data you want to visualize, click any of the rows in the table, and then from the Ribbon, select Other Chart and then Line. This will change the table to a line chart representing the number of Homicides per year. Drag the corners of the window to size the graph to fit your canvas.

On the report canvas, click outside of the graph that you just created, and then from the Power View Fields area, select the Location Description item in the ChicagoCrimes table. Also from the Power View Fields area select the Primary Type field and then from the Values area, choose the Primary Type field and change the drop down to Count (Not Blank). The result should look like:

Click any row in the new table and then from the Ribbon, select Bar Chart and then select Stacked Bar. This will change the table to a stacked bar chart. Above the chart, select the Count of Primary Type field in the sort by area, and then select desc to sort in descending order.

Click on a blank area of the canvas, and from the Power View Fields area, select Primary Type and then expand the Community table and select the Communityad field. In the Values area, change the Primary Type field to Count (Not Blank) .

Click on any row in the new table, and from the Ribbon, select Map. When the map renders, drag the corners of the window to resize the map accordingly.

You can also click the arrow in the Filters area to collapse the filter, and then type a title into the Title area.

The result is a very nice dashboard-like visualization of Chicago homicide data that is easy to interpret. You can save the Excel workbook at this time to ensure that the dashboard is saved. You can add additional dashboards by selecting the + item at the bottom of the page and then repeating the steps above to add different fields to the canvas.

Conclusion

In this post, we continued the construction of our demonstration database by adding indexes to improve query efficiency and then we used Microsoft Excel, Power Pivot and Power View to create compelling visualizations from the connected data.

In the next post in this series, I will explain how to update the database with new data and then refresh the Excel workbook to include the new data. In future posts we will also use Microsoft Power BI to visualize the data and will also connect the data to Microsoft Azure Machine Learning to perform predictive analytics on the Crime Data.