Volume 34 Number 6
Exploring Data with R
By Frank La | June 2019 | Get the Code
Since the very first Artificially Intelligent column, all the code samples I’ve provided have been in Python. That’s because Python currently reigns as the language of data science and AI. But it’s not alone—languages like Scala and R hold a place of prominence in this field. For developers wondering why they must learn yet another programming language, R has unique aspects that I’ve not encountered elsewhere in a career that’s spanned Java, C#, Visual Basic, Python and Perl. With R being a language that readers are likely to encounter in the data science field, I think it’s worth exploring here.
R itself is an implementation of the S programming language, which was created in the 1970s for statistical processing at Bell Labs. S was designed to provide an interactive experience for developers who at the time worked with Fortran for statistical processing. While we take interactive programming environments for granted today, it was revolutionary at the time.
R was conceived in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and derives its name from the first initial of its creators, while also playing on the name of S. Version 1.0.0 of R was released in 2000 and has since enjoyed wide adoption in research departments thanks in part to its wide array of built-in statistical algorithms. It’s also easily extensible via functions and extension packages.
A robust developer community has emerged around R, with the most popular repository for R packages being the Comprehensive R Archive Network (CRAN). CRAN has various packages that cover anything from Bayesian Accrual Prediction to Spectral Processing for High Resolution Flow Infusion Mass Spectrometry. A complete list of R packages available in CRAN is online at bit.ly/2DGjuEJ. Suffice it to say that R and CRAN provide robust tools for any data science or scientific research project.
Getting Started with R
Perhaps the fastest way to run R code is through a Jupyter Notebook on the Azure Notebook service. For details on Jupyter Notebooks, refer to my February 2018 article on the topic at msdn.com/magazine/mt829269. However, this time make sure to choose R as the language when creating a new notebook. The R logo should appear on the top right of the browser window. In a blank cell, enter the following code and execute it:
# My first R code print("hello world") x <- 3.14 y = 1.21 x y
The output should read the traditional “hello world” greeting, as well as the values 3.14 and 1.21. None of this should come as novel or unique to any software developer. Note that the assignment operator can also be “<-” and not just the more commonly used equals sign. Both are syntactically equal. Also take note that the # character introduces a comment and applies to the rest of the line.
Vectors are one-dimension arrays that can hold numeric data, character data or logical data. They’re created with the c function. The c stands for “combine.” Enter the following into a new cell and execute it:
num_vec <- c(1,2,3.14) # numeric vector char_vec <- c("blog","podcast","livestream") # character vector bool_vec <- c(TRUE,TRUE,FALSE) #logical vector #print out values num_vec char_vec bool_vec
The values displayed should match the values set in the code. You may now be wondering if vectors can contain mixed types. Enter the following code into a new cell:
mix_vec <- c(1,"lorem ispum",FALSE) mix_vec
While the code does run, sharp-eyed readers will notice single quotes around each element in the vector. This indicates that the values were converted to character values. R has the typeof function to check the type of any given variable. Enter the following code to inspect the vectors already created:
typeof(num_vec) typeof(char_vec) typeof(bool_vec) typeof(mix_vec)
One other useful function to know is ls, which displays all the objects in the current working environment. Enter “ls()” into a new cell, execute it, and observe that the output contains the four vectors just defined, along with the x and y variables defined in the first cell.
Working with Data
The best way to experience the true power and elegance of the R language is by using it to explore and manipulate data. R makes it easy to load datasets and quickly get an understanding of their dimensions, structure and statistical properties. For the next few examples, I’ll use a dataset that’s near and dear to me: basic statistics on my blogging activity. I’ve run and maintained a technology blog since 2004 and have kept basic statistics on how frequently I posted each month. Additionally, I have added the number of days in each month and the average post per day value (PPD). PPD is the number of posts in a given month divided by the number of days in that month. I have placed the CSV file in the project library on the Azure Notebook Service at bit.ly/2V76d2G.
Enter the following code into a new cell to load the data into an R data frame, a tabular data structure with columns for variables and rows for observations, and display the first six and the last three records using the head and tail functions, respectively, like so:
postData <- read.csv(file="franksworldposts.csv", header=TRUE, sep=",") head(postData) tail(postData, 3)
Using the str function, I can view the basic structure and data types of the DataFrame. Enter the following code into a new cell:
The output should reveal that the DataFrame has 183 observations, or rows, and consists of four variables, or columns. The Posts and Days.in.Month variables are integers, while the PPD is a numeric type. The Month variable is a factor with 183 levels, where factor is a data type that corresponds to categorical variables in statistics. Factors are the functional equivalent to categorical in Python Pandas and can be either strings or integers. They’re ideal for variables with a limited number of unique values, or, in R terms, levels. In this DataFrame, the Month field represents a month between February 2004 and April 2019. As dates do not repeat, there are no duplicate categorical values.
Now that my data is loaded, I can sort and query it to explore it further. Perhaps I can glean some insights. For instance, if I wanted to view the top-10 months where I was most productive on my blog, I could perform a descending sort on the Posts column. To do so, enter the following code into a new cell and execute it:
sortedPostData <- postData[order(-postData$Posts),] head(sortedPostData, 10)
The top-10 most active months have all been within the last three years. To explore the data set further, I can perform a filtering operation to determine which months have had 100 or more posts. In R, the subset function does just that. Enter the following code to apply this filter and assign the output to a new DataFrame called over100, like so:
over100 <- subset(postData, subset = Posts >= 100) over100
The results look similar to the previous output of the top 10. To check the count of rows, use the nrow function to count the number of rows in the DataFrame, like this:
The output indicates that there are 11 rows where there were 100 or more blog posts in a given month. With 100 posts, May 2005 just missed the top-10 most active months, falling into 11th place. Clearing the 100-posts-per-month threshold wasn’t a milestone I would reach again for 11 years. Is there a pattern of starting the blog with intensity only to have it fade out and then pick it up again? Let’s examine the data further.
Now would be a good time to explore how to view individual rows and columns in a DataFrame. For example, to view the first row in the DataFrame, enter the following code to view the contents of the entire row:
Note that the index for the DataFrame starts at 1 and not 0, as in most other programming languages. To view just the Posts field for the first row, enter the following code:
To view all the values in the Posts field, use the following line of code:
Alternatively, you may also use the following syntax to display the columns based on their name. Enter the following line of code and confirm that its output matches the output from the line prior:
As R has its roots in statistical processing, there are many built in functions to view the basic shape and properties of the data. Use the following code to get a better understanding of the data in the Post column:
mean(postData$Posts) max(postData$Posts) min(postData$Posts) summary(postData$Posts)
Now, compare this to the PPD column, like so:
mean(postData$PPD) max(postData$PPD) min(postData$PPD) summary(postData$PPD)
From the data we see that the number of posts vary from one per month all the way to 225 over the course of 15 years. What if I wanted to explore only the first year? Enter the following code to display only the records for the first year of blogging, along with statistical summaries for the Post and PPD fields:
postData[1:12,] summary(postData[1:12,2]) # Posts summary(postData[1:12,4]) # PPD
While the numbers here tell a story, very often a graph will reveal more about trends and patterns. Fortunately, R has rich graph plotting capabilities built in. Let’s explore those.
Creating plots in R is very simple and can be done with a single line of code. Let’s start by using the post counts and PPD values for the first year. Here’s the code to do that:
plot(postData[1:12,2], xlab="Month Index", ylab="Posts", main="Posts in the 1st Year") plot(postData[1:12,4], xlab="Month Index", ylab="PPD", main="PPD in the 1st Year")
The output should resemble Figure 1.
Figure 1 Plotting the Posts and PPD Columns
For the first year of blogging, the graph shows that post activity steadily grew the first year with a steep growth curve between the third and sixth months. After a late summer dip, 2004 finished up strong. Additionally, the graphs reveal that there’s high correlation between the number of posts in a month and the number of posts per day. While this may be intuitive, it’s interesting to see it displayed in graph form.
Now, I would like to see a graph of blog posts over the entire 15-year span and see if a pattern emerges over a longer period of time. Enter the following code to graph the entire timespan:
plot(postData[,2], xlab="Month Index", ylab="Posts", main="All Posts")
The results, shown in Figure 2, do show a clear trend, if not a well-defined pattern. Blogging activity started out fairly strong but declined steadily, picking up again around 30 months ago. The trend of late is decidedly upward. There’s also the one significant outlier.
Figure 2 Posts Over 15 Years
Earlier, I noted a correlation between the Posts and PPD columns. R has a built-in function to display a correlation matrix, which is a table displaying correlation coefficients between variables. Each cell in the table shows the correlation between two variables.
A correlation matrix quickly summarizes data and reveals relationships between variables. Values closer to 1 have a high correlation, while those closer to 0 have low correlation. Negative values indicate a negative correlation. To view the correlation matrix for the postData DataFrame, it’s first necessary to isolate the numeric fields into their own DataFrame and then call the cor function. Enter the following code into a new cell and execute it:
postsCor <- postData[, c(2, 3 ,4)] cor(postsCor)
The output reveals a near-perfect correlation between Posts and PPD, while Days.In.Month has a slightly negative correlation to PPD.
While R’s syntax and approach may differ from traditional programming languages, I find it an elegant solution for data wrangling and mathematical processing. For software engineers serious about building a career in data science, R is an important skill to develop.
In this article, I explored some of the fundamentals of the R programming language. I showed how to use built-in functions to load and explore data within DataFrames, to gain insights through statistics, and to plot graphs. In fact, everything in this article was written in what would be referred to as “base” R, as it doesn’t rely on any third-party packages. However, some R users prefer the “tidyverse” suite of packages, which uses a different style. I’ll explore that in an upcoming column.
Frank La Vigne works at Microsoft as an AI Technology Solutions Professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical expert for reviewing this article: Andy Leonard, David Smith
David Smith is a Cloud Advocate for Microsoft, specializing in the topics of artificial intelligence and machine learning. Since 2009 he has been the editor of the Revolutions blog (blog.revolutionanalytics.com), where he writes regularly about applications of data science with a focus on the R programming language. He is also a founding member of the R Consortium. Follow David on Twitter as @revodavid.