A place to collect my thoughts on data analysis within Enrollment Management. Dare I call it Enrollment Science?
I long have wanted to post about the things I am doing with R
and
Google Analytics for my division, so I was very excited to see that
tonight’s #emchat
is on web analytics. This post will be more on the
technical end, but I wanted to highlight what’s possible beyond the
basic point-and-click interface in Google Analytics.
Google, as well as other vendors, provide APIs to our web data. Basically this just means that we can create queries using the API and return the data into a nice form just like we can any other database on campus. By passing different combinations of parameters (dimensions, metrics, filters), we can ask Google to return pretty detailed information about our web traffic. Pretty sweet, right?
In full disclosure, this wont be the most detailed post, but hopefully show you just enough of what’s possible when we pull data straight from the API.
One last thing before I dive in. This post assumes that your school is
your Gogole Analytics. I will be using an R package specifically for GA.
If you don’t use GA, no worries, the R
community is amazing. Check out
this site if you use Adobe
SiteCatalyst.
First, nearly all of my posts will talk about R
and Python
. If you
are on a Mac, python is already installed. Just fire up a terminal and
type python
. Voila, you are now ready to code!
R
, however, is not pre-installed. There are millions of tutorials on
how to get up and running. Maybe at some point I will provide a more
comprehensive overview on how to turn your computer into a lean, mean,
data crunching machine (for free!), but for time being, here is my R
Setup.
In this brief post, I want to highlight the R
package
rga, which provides a super simple
way to connect to your Google Analytics account and to pull data down
from the API.
Before you can begin, you will need to install a few packages if you
don’t already have them loaded into R
. The following codes will load
the packages that you need for this post.
install.packages("devtools")
library(devtools)
install_github("rga", "skardhamar")
install.packages("ggplot2")
install.packages("forecast")
Now let’s load the packages that we will use to crunch our data.
library(rga)
library(ggplot2)
library(forecast)
The next code segment is necessary if you are Windows. In full disclosure, this is not an ideal workaround, but when using secure connections (i.e. the Analytics API) you need a certificate. Windows doesn’t handle this that well out of the box.
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL",
"cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
Let’s connect to Google.
rga.open(instance = "ga")
This will fire up a browser tab and show a similar screen.
Click Accept. The next screen will show you a very long string of
characters. This is our token that let’s us connect to GA. Copy this
string, as we need to paste this string back in the terminal in R
Simply paste the string at the prompt.
That’s it! I know that might seem like way too many steps, but authenticating with APIs around the web can be a real PITA.
Take a look at the profiles that you have access to. Pretty cool huh?
ga$getProfiles()
Keep note of the profile you want to query. The value you need is found
in the id
column.
The next part will appear pretty technical, but Google has created a Query Explorer to help navigate the API. I suggest you go play around with the tool. Above all else, it will help you find the values for the dimensions and metrics that you want to include in your queries.
Below I am going to set some values that will be passed into a function that will query GA.
ids = "ga:XXXXXXXXXXX"
start.date = "2013-01-01"
end.date = "2013-10-31"
DIM = "ga:date"
MET = "ga:pageviews,ga:uniquePageviews,ga:entrances,ga:exits,ga:bounces,ga:timeOnPage"
The value for ids
is my GA profile id. start.date
and end.date
define the timeframe for our query. DIM
defines the dimensions, which
simply is the row
of our query. The columns, or metrics in
Google-speak, are defined in the variable MET
.
Now let’s get the data!
stats = ga$getData(ids, start.date, end.date, walk = TRUE, batch = TRUE, metrics = MET,
dimensions = DIM, sort = "", filters = "", segment = "")
Above, all I am doing is saving the results of our API call to a
variable call stats
. I will just highlight this topic, but the
variable stats
is a data.frame
in R
. At the end of the day, the
structure of a data.frame
is nearly identical to a dataset that you
might have in Excel.
I probably should comment on the data that was just retrieved. I am grabbing daily data for our website from Jan 1, 2013 to October 31,
Below is a (time series) plot of my daily pageviews. It’s pretty easy to see that there are patterns in the data.
I am sure that the group will talk about this tonight, but this type of chart is nothing special and can easily be found in GA.
However, the benefit of R is that we can now create a forecast from this data. We can’t do this in Google Analytics.
fit = auto.arima(stats$pageviews)
plot(forecast(fit), axes = F)
The chart above leverages a built-in function of the forecast
packaged
called auto.arima
. This uses an algorithm which attempts to fit the
best model to our daily pageviews. The forecast is in blue, and the
chart shows us the confidence intervals for these estimates.
Why should we care about forecasting? Let’s say your team is about to launch a redesign of some, maybe all, of the pages on your site. You might want to forecast the pageviews for your site over the next month. When the data starts pouring in, you can compare the actual pageview stats to the predicted values in order to see how well your new site is performing. If you are constantly beating estimates, you probably can conclude that the redesign worked!
By leveraging R
, we were able to pull down data from GA and not only
replicate the reports found online, but add additional value by
forecasting what we think performance should be.
Stepping back, let’s dive into some other reasons as to why you might want to consider using the API.
Depending on how much traffic your site gets, Google might return
sampled data. Normally this isn’t the end of the world, as the story
of your data won’t change. However, we usually can get around this
by using the batch
and walk
parameters above. This splits our
request into a number of daily queries, and if necessary, collects
all of the data in batches. Google has a limit of 10,000 rows per
page. The batch
command intelligently handles this for us if our
query returns more than 10k rows.
You can answer some really deep questions that are impossible to answer using the online reporting, at least easily. For example, you can pull down the stats for each page on your website by geography. If you wanted to, you could rank the pages by key geographic markets and evaluate how the top pages perform across these markets. Is the rank order the same? Are some markets more likely to view your Financial Aid page? Are other markets more interested in campus life?
By leveraging the API, you can store the data you get in a database. I love Tableau. It’s a huge part of my workflow. To build my monthly web reports, I query the data using R, and save my queries into MySQL. I then connect Tableau to the MySQL database, and just refresh my report pages. That’s it! A pretty detailed report takes me all of 15 minutes to complete.
Speaking of API requests, you should refer to Google’s documentation. More than likely, you won’t hit the daily API limit, but it’s something to keep in mind.
I have long hoped that the #emchat
community would unite and start to
share comparable web data on a monthly basis. The lack of benchmark data
makes it extremely difficult to assess website performance. In a perfect
world, we would even go as far as tracking things like Info request
conversion rates. Think about it. What if you could use benchmark data
to highlight that your institution’s lead form is under performing your
peers. Collecting and storing this data doesn’t have to be difficult. It
can be scripted with R
!
Hopefully you found this post valuable. As always, please let me know what you think. Happy data nerdery!
comments powered by Disqus