Just like in the previous
we will be using
R to access our school’s Google Analytics data
through their API. In this post, I want to highlight how we can figure
out when a vistor to our website completes our a goal on our site. In
my case, I am interested in learning more about how, and when,
prospective students (and/or parents) complete our information request
form. This could be any goal on your site, but our recruit pool data
tend to confirm that self-initiated actions are strong predictors of
interest. This is why I tend to emphasize these actions over
“soft-interest” conversions like a simple click-through’s on a random
email. Before we begin, I assume that you are relatively familiar with
the Google Analytics, what data are available, and that you have goals
setup for your website. In my case, we told Google that one of our
“goals” was the completion page of the web request form. I won’t talk
about why goals are massively awesome things to have setup in GA, but
if this concept is new to you, check out this
for an overview.
In the context of
R, I am going to make one assumption. If you have
been playing around with the
rga package, you probably have figured
out that it’s really helpful to save our connection object for later
sessions. This prevents us from having to authenticate each time we want
data. For help on the package, look
here. After firing up
setup or environment and reconnect to the API for our undergraduate
account. Below, I am using the
where argument to reference the
uga.rga file in my current directory. This file contains my saved
## load the R package we use to access Google Analytics library(rga) ## not ideal, but a setting that we need to apply if using Windows options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE)) ## the token for GA rga.open(instance = "ga", where = "uga.rga")
Now that we connected to the API, we can start to have some fun. Before going too crazy, let’s answer the basic question of who. Simply, of the people that convert, are they New or Returning vistitors? We are going to count the visits by New and Returning visitors from January through November 2013.
start.date = "2013-01-01" end.date = "2013-11-30" DIM = "ga:visitorType" MET = "ga:visits" ## get the data type = ga$getData("ga:XXXXXXXX", start.date, end.date, walk = TRUE, metrics = MET, dimensions = DIM, sort = "", filters = "", segment = "dynamic::ga:goal1Completions>=1", start = 1, max = 10000)
One thing reql quick. I want to point out how we can define segments
“on-the-fly” in the API. If you use the web reporting tool for GA, we
can define Advanced Segments. These segments allow you put your
traffic into buckets. While you can access these using the API as well,
we can also generate these programatically by dusing
feature is prett helpful in my opinion. Also, we were able to avoid
sampled data by using the
walk argument above, but it means that we
now have to aggregate the data by
type_summary = aggregate(visits ~ visitorType, data = type, FUN = "sum") type_summary$pct = type_summary$visits/sum(type_summary$visits) type_summary visitorType visits pct 1 New Visitor xxxx 0.6085 2 Returning Visitor xxx 0.3915
After printing out the data, we can see that about 61% of our information request form conversions were from New Visitors between January and November 2013.
Now let’s dig a bit deeper and try to answer the question of when they
convert. In this case, I am defining when as the number of visits
before for someone to completes the form. These data will be pulled into
a data frame called
## use http://ga-dev-tools.appspot.com/explorer/ to explore query strings start.date = "2013-01-01" end.date = "2013-11-30" DIM = "ga:date,ga:visitCount" MET = "ga:visits" ## get the data basic = ga$getData("ga:XXXXXXXX", start.date, end.date, walk = TRUE, metrics = MET, dimensions = DIM, sort = "", filters = "", segment = "dynamic::ga:goal1Completions>=1", start = 1, max = 10000)
First, we should take a peak what we pulled down to ensure that our dataset looks as expected.
class(basic)  "data.frame" dim(basic)  893 3 head(basic) date visitCount visits 1 2013-01-01 x x 2 2013-01-01 x x 3 2013-01-02 x x 4 2013-01-02 x x 5 2013-01-03 x x 6 2013-01-03 xx x
At a very high level, how many visits does it take to convert a suspect?
round(mean(basic$visitCount), 2)  5.25
We see that our info request conversions typically take between 5 and 6 visits. But wait, didn’t we just point out that 61% of our conversions were from New Visitors? Because averages are easily influenced by extreme values, we should visualize the distribtion.
hist(basic$visitCount, main = "Distribution of Visits required to Convert", xlab = "# Visits", col = "red", breaks = 100)
Now things are starting to make sense. We have some very large values. Let’s standardize the data and remove these outliers.
## copy our data basic2 = basic ## create a new variable that is the standardized value basic2$z = scale(basic2$visitCount) ## keep only scaled values +/- 3 (in reality, only '+' values exist) basic2 = subset(basic2, z >= -3 & z <= 3) ## re-plot the distribution hist(basic2$visitCount, main = "Distribution of Visits required to Convert", xlab = "# Visits", col = "red", breaks = 100)
After removing very large values, our distribution starts to take shape. The chart confirms that the large majority are new visitors, but we can see that there are a decent number of conversions that happen well after the first visit. To me, these are the lurkers that we should attempt to learn more about in the future. Now, I am curious as to how many visits it takes after the first visit. Below, I am going to group (or bin) the data.
## cut our data into bands. (0,1] = 1 visit, (1, 2] = 2 visits, (8, 14] = ## 8-14 visits basic2 = transform(basic2, bins = cut(visitCount, breaks = c(0:7, 14, 21, 100))) ## put our data into a summary table using the plyr package library(plyr) visit_summary = ddply(basic2, .(bins), summarise, visits = sum(visits)) visit_summary = transform(visit_summary, pct_total = round(visits/sum(visits), 3)) visit_summary bins visits pct_total 1 (0,1] xxxx 0.609 2 (1,2] xxx 0.187 3 (2,3] xxx 0.069 4 (3,4] xx 0.038 5 (4,5] xx 0.026 6 (5,6] xx 0.015 7 (6,7] xx 0.012 8 (7,14] xx 0.031 9 (14,21] xx 0.007 10 (21,100] xx 0.008
We can see that the large majority of visitors will go on to request information within the first 3 visits to our site. I know that this is a stretch, but to me this suggests that we only have about 3 chances to influence lurkers, or those that are window shopping our institution. Just because I can’t help myself, one last cut of the data. I am going to manually classify our data into New/Returning visitors and explore if the Month impacts who converts.
## clean up the month from our date variable (which is stored as a date) basic2 = transform(basic2, month = month(date, label = TRUE)) ## manually classify visits as New/Returning basic2 = transform(basic2, visit_type = ifelse(visitCount == 1, "New", "Returning")) ## summarize the data before we plot it basic2_summ = ddply(basic2, .(month, visit_type), summarise, visits = sum(visits)) ## plot the distribtions for each month using the ggplot2 plotting library library(ggplot2) ggplot(basic2_summ, aes(x = month, y = visits, fill = factor(visit_type))) + geom_bar(position = "fill", stat = "identity")
Visually, I am not sure there is a strong pattern in our data. However, there might be some evidence to suggest that our conversions increasingly come from New Visits during the fall months; senior year if you are looking at this at the undergraduate level.
Above, I ran through some quick code to determine the number of visits
it takes before a suspect will request more information from our
institution. In addition, we were able to figure out if our conversions
are coming from New or Returning visitors. Stepping back, you could have
used the web reporting interface to answer a few of the questions above,
but where is the fun in that? All kidding aside, this is only a fraction
of what we could have done. For example, we could have isolated
conversions with a
visitCount > 1 and then studied how the traffic
came to our site. In addition, we could also explore if we have longer
conversion cycles based on visitor geography or even evaluted the
conversion impact of mobile devices.