A place to collect my thoughts on data analysis within Enrollment Management. Dare I call it Enrollment Science?
Just like in the previous
entry,
we will be using R
to access our school’s Google Analytics data
through their API. In this post, I want to highlight how we can figure
out when a vistor to our website completes our a goal on our site. In
my case, I am interested in learning more about how, and when,
prospective students (and/or parents) complete our information request
form. This could be any goal on your site, but our recruit pool data
tend to confirm that self-initiated actions are strong predictors of
interest. This is why I tend to emphasize these actions over
“soft-interest” conversions like a simple click-through’s on a random
email. Before we begin, I assume that you are relatively familiar with
the Google Analytics, what data are available, and that you have goals
setup for your website. In my case, we told Google that one of our
“goals” was the completion page of the web request form. I won’t talk
about why goals are massively awesome things to have setup in GA, but
if this concept is new to you, check out this
link
for an overview.
In the context of R
, I am going to make one assumption. If you have
been playing around with the rga
package, you probably have figured
out that it’s really helpful to save our connection object for later
sessions. This prevents us from having to authenticate each time we want
data. For help on the package, look
here. After firing up R
, let’s
setup or environment and reconnect to the API for our undergraduate
account. Below, I am using the where
argument to reference the
uga.rga
file in my current directory. This file contains my saved
credentials.
## load the R package we use to access Google Analytics
library(rga)
## not ideal, but a setting that we need to apply if using Windows
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL",
"cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
## the token for GA
rga.open(instance = "ga", where = "uga.rga")
Now that we connected to the API, we can start to have some fun. Before going too crazy, let’s answer the basic question of who. Simply, of the people that convert, are they New or Returning vistitors? We are going to count the visits by New and Returning visitors from January through November 2013.
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = "ga:visitorType"
MET = "ga:visits"
## get the data
type = ga$getData("ga:XXXXXXXX", start.date, end.date, walk = TRUE, metrics = MET,
dimensions = DIM, sort = "", filters = "", segment = "dynamic::ga:goal1Completions>=1",
start = 1, max = 10000)
One thing reql quick. I want to point out how we can define segments
“on-the-fly” in the API. If you use the web reporting tool for GA, we
can define Advanced Segments. These segments allow you put your
traffic into buckets. While you can access these using the API as well,
we can also generate these programatically by dusing dynamic::
. This
feature is prett helpful in my opinion. Also, we were able to avoid
sampled data by using the walk
argument above, but it means that we
now have to aggregate the data by visitorType
.
type_summary = aggregate(visits ~ visitorType, data = type, FUN = "sum")
type_summary$pct = type_summary$visits/sum(type_summary$visits)
type_summary
visitorType visits pct
1 New Visitor xxxx 0.6085
2 Returning Visitor xxx 0.3915
After printing out the data, we can see that about 61% of our information request form conversions were from New Visitors between January and November 2013.
Now let’s dig a bit deeper and try to answer the question of when they
convert. In this case, I am defining when as the number of visits
before for someone to completes the form. These data will be pulled into
a data frame called basic
.
## use http://ga-dev-tools.appspot.com/explorer/ to explore query strings
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = "ga:date,ga:visitCount"
MET = "ga:visits"
## get the data
basic = ga$getData("ga:XXXXXXXX", start.date, end.date, walk = TRUE, metrics = MET,
dimensions = DIM, sort = "", filters = "", segment = "dynamic::ga:goal1Completions>=1",
start = 1, max = 10000)
First, we should take a peak what we pulled down to ensure that our dataset looks as expected.
class(basic)
[1] "data.frame"
dim(basic)
[1] 893 3
head(basic)
date visitCount visits
1 2013-01-01 x x
2 2013-01-01 x x
3 2013-01-02 x x
4 2013-01-02 x x
5 2013-01-03 x x
6 2013-01-03 xx x
At a very high level, how many visits does it take to convert a suspect?
round(mean(basic$visitCount), 2)
[1] 5.25
We see that our info request conversions typically take between 5 and 6 visits. But wait, didn’t we just point out that 61% of our conversions were from New Visitors? Because averages are easily influenced by extreme values, we should visualize the distribtion.
hist(basic$visitCount, main = "Distribution of Visits required to Convert",
xlab = "# Visits", col = "red", breaks = 100)
Now things are starting to make sense. We have some very large values. Let’s standardize the data and remove these outliers.
## copy our data
basic2 = basic
## create a new variable that is the standardized value
basic2$z = scale(basic2$visitCount)
## keep only scaled values +/- 3 (in reality, only '+' values exist)
basic2 = subset(basic2, z >= -3 & z <= 3)
## re-plot the distribution
hist(basic2$visitCount, main = "Distribution of Visits required to Convert",
xlab = "# Visits", col = "red", breaks = 100)
After removing very large values, our distribution starts to take shape. The chart confirms that the large majority are new visitors, but we can see that there are a decent number of conversions that happen well after the first visit. To me, these are the lurkers that we should attempt to learn more about in the future. Now, I am curious as to how many visits it takes after the first visit. Below, I am going to group (or bin) the data.
## cut our data into bands. (0,1] = 1 visit, (1, 2] = 2 visits, (8, 14] =
## 8-14 visits
basic2 = transform(basic2, bins = cut(visitCount, breaks = c(0:7, 14, 21, 100)))
## put our data into a summary table using the plyr package
library(plyr)
visit_summary = ddply(basic2, .(bins), summarise, visits = sum(visits))
visit_summary = transform(visit_summary, pct_total = round(visits/sum(visits),
3))
visit_summary
bins visits pct_total
1 (0,1] xxxx 0.609
2 (1,2] xxx 0.187
3 (2,3] xxx 0.069
4 (3,4] xx 0.038
5 (4,5] xx 0.026
6 (5,6] xx 0.015
7 (6,7] xx 0.012
8 (7,14] xx 0.031
9 (14,21] xx 0.007
10 (21,100] xx 0.008
We can see that the large majority of visitors will go on to request information within the first 3 visits to our site. I know that this is a stretch, but to me this suggests that we only have about 3 chances to influence lurkers, or those that are window shopping our institution. Just because I can’t help myself, one last cut of the data. I am going to manually classify our data into New/Returning visitors and explore if the Month impacts who converts.
## clean up the month from our date variable (which is stored as a date)
basic2 = transform(basic2, month = month(date, label = TRUE))
## manually classify visits as New/Returning
basic2 = transform(basic2, visit_type = ifelse(visitCount == 1, "New", "Returning"))
## summarize the data before we plot it
basic2_summ = ddply(basic2, .(month, visit_type), summarise, visits = sum(visits))
## plot the distribtions for each month using the ggplot2 plotting library
library(ggplot2)
ggplot(basic2_summ, aes(x = month, y = visits, fill = factor(visit_type))) +
geom_bar(position = "fill", stat = "identity")
Visually, I am not sure there is a strong pattern in our data. However, there might be some evidence to suggest that our conversions increasingly come from New Visits during the fall months; senior year if you are looking at this at the undergraduate level.
Above, I ran through some quick code to determine the number of visits
it takes before a suspect will request more information from our
institution. In addition, we were able to figure out if our conversions
are coming from New or Returning visitors. Stepping back, you could have
used the web reporting interface to answer a few of the questions above,
but where is the fun in that? All kidding aside, this is only a fraction
of what we could have done. For example, we could have isolated
conversions with a visitCount > 1
and then studied how the traffic
came to our site. In addition, we could also explore if we have longer
conversion cycles based on visitor geography or even evaluted the
conversion impact of mobile devices.