Use R to Analyze Google Analytics

I long have wanted to post about the things I am doing with R and Google Analytics for my division, so I was very excited to see that tonight’s #emchat is on web analytics. This post will be more on the technical end, but I wanted to highlight what’s possible beyond the basic point-and-click interface in Google Analytics.

Why do this post?

Google, as well as other vendors, provide APIs to our web data. Basically this just means that we can create queries using the API and return the data into a nice form just like we can any other database on campus. By passing different combinations of parameters (dimensions, metrics, filters), we can ask Google to return pretty detailed information about our web traffic. Pretty sweet, right?

In full disclosure, this wont be the most detailed post, but hopefully show you just enough of what’s possible when we pull data straight from the API.

One last thing before I dive in. This post assumes that your school is your Gogole Analytics. I will be using an R package specifically for GA. If you don’t use GA, no worries, the R community is amazing. Check out this site if you use Adobe SiteCatalyst.

Setup

First, nearly all of my posts will talk about R and Python. If you are on a Mac, python is already installed. Just fire up a terminal and type python. Voila, you are now ready to code!

R, however, is not pre-installed. There are millions of tutorials on how to get up and running. Maybe at some point I will provide a more comprehensive overview on how to turn your computer into a lean, mean, data crunching machine (for free!), but for time being, here is my R Setup.

Install R here\
Download RStudio - an amazing IDE that really eases the process of learning R\

OK, enough blabbing.

In this brief post, I want to highlight the R package rga, which provides a super simple way to connect to your Google Analytics account and to pull data down from the API.

Before you can begin, you will need to install a few packages if you don’t already have them loaded into R. The following codes will load the packages that you need for this post.

install.packages("devtools")
library(devtools)
install_github("rga", "skardhamar")
install.packages("ggplot2")
install.packages("forecast")

Now let’s load the packages that we will use to crunch our data.

library(rga)
library(ggplot2)
library(forecast)

The next code segment is necessary if you are Windows. In full disclosure, this is not an ideal workaround, but when using secure connections (i.e. the Analytics API) you need a certificate. Windows doesn’t handle this that well out of the box.

options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", 
    "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

Let’s connect to Google.

rga.open(instance = "ga")

This will fire up a browser tab and show a similar screen.

Auth
Screen

Click Accept. The next screen will show you a very long string of characters. This is our token that let’s us connect to GA. Copy this string, as we need to paste this string back in the terminal in R

Auth
Screen

Simply paste the string at the prompt.

Paste
Code

That’s it! I know that might seem like way too many steps, but authenticating with APIs around the web can be a real PITA.

Finally, the analysis

Take a look at the profiles that you have access to. Pretty cool huh?

ga$getProfiles()

Keep note of the profile you want to query. The value you need is found in the id column.

The next part will appear pretty technical, but Google has created a Query Explorer to help navigate the API. I suggest you go play around with the tool. Above all else, it will help you find the values for the dimensions and metrics that you want to include in your queries.

Below I am going to set some values that will be passed into a function that will query GA.

ids = "ga:XXXXXXXXXXX"
start.date = "2013-01-01"
end.date = "2013-10-31"
DIM = "ga:date"
MET = "ga:pageviews,ga:uniquePageviews,ga:entrances,ga:exits,ga:bounces,ga:timeOnPage"

The value for ids is my GA profile id. start.date and end.date define the timeframe for our query. DIM defines the dimensions, which simply is the row of our query. The columns, or metrics in Google-speak, are defined in the variable MET.

Now let’s get the data!

stats = ga$getData(ids, start.date, end.date, walk = TRUE, batch = TRUE, metrics = MET, 
    dimensions = DIM, sort = "", filters = "", segment = "")

Above, all I am doing is saving the results of our API call to a variable call stats. I will just highlight this topic, but the variable stats is a data.frame in R. At the end of the day, the structure of a data.frame is nearly identical to a dataset that you might have in Excel.

I probably should comment on the data that was just retrieved. I am grabbing daily data for our website from Jan 1, 2013 to October 31,

I am telling Google to give us daily stats for pageviews, uniquepageviews, entrances, exits, bounces, and total time, which is returned in seconds.

Below is a (time series) plot of my daily pageviews. It’s pretty easy to see that there are patterns in the data.

Timeseries

I am sure that the group will talk about this tonight, but this type of chart is nothing special and can easily be found in GA.

However, the benefit of R is that we can now create a forecast from this data. We can’t do this in Google Analytics.

fit = auto.arima(stats$pageviews)
plot(forecast(fit), axes = F)

forecast

The chart above leverages a built-in function of the forecast packaged called auto.arima. This uses an algorithm which attempts to fit the best model to our daily pageviews. The forecast is in blue, and the chart shows us the confidence intervals for these estimates.

Who cares?

Why should we care about forecasting? Let’s say your team is about to launch a redesign of some, maybe all, of the pages on your site. You might want to forecast the pageviews for your site over the next month. When the data starts pouring in, you can compare the actual pageview stats to the predicted values in order to see how well your new site is performing. If you are constantly beating estimates, you probably can conclude that the redesign worked!

By leveraging R, we were able to pull down data from GA and not only replicate the reports found online, but add additional value by forecasting what we think performance should be.

Next Steps

Stepping back, let’s dive into some other reasons as to why you might want to consider using the API.

Depending on how much traffic your site gets, Google might return sampled data. Normally this isn’t the end of the world, as the story of your data won’t change. However, we usually can get around this by using the batch and walk parameters above. This splits our request into a number of daily queries, and if necessary, collects all of the data in batches. Google has a limit of 10,000 rows per page. The batch command intelligently handles this for us if our query returns more than 10k rows.
You can answer some really deep questions that are impossible to answer using the online reporting, at least easily. For example, you can pull down the stats for each page on your website by geography. If you wanted to, you could rank the pages by key geographic markets and evaluate how the top pages perform across these markets. Is the rank order the same? Are some markets more likely to view your Financial Aid page? Are other markets more interested in campus life?
By leveraging the API, you can store the data you get in a database. I love Tableau. It’s a huge part of my workflow. To build my monthly web reports, I query the data using R, and save my queries into MySQL. I then connect Tableau to the MySQL database, and just refresh my report pages. That’s it! A pretty detailed report takes me all of 15 minutes to complete.
Speaking of API requests, you should refer to Google’s documentation. More than likely, you won’t hit the daily API limit, but it’s something to keep in mind.

My Crazy Idea

I have long hoped that the #emchat community would unite and start to share comparable web data on a monthly basis. The lack of benchmark data makes it extremely difficult to assess website performance. In a perfect world, we would even go as far as tracking things like Info request conversion rates. Think about it. What if you could use benchmark data to highlight that your institution’s lead form is under performing your peers. Collecting and storing this data doesn’t have to be difficult. It can be scripted with R!

Hopefully you found this post valuable. As always, please let me know what you think. Happy data nerdery!