A place to collect my thoughts on data analysis within Enrollment Management. Dare I call it Enrollment Science?

A simple example of predictive analytics for Enrollment
Managers using **FREE** tools.

Using `R`

we can fit all sorts
of complex models in Enrollment Management, quickly, and for no cost. In
truth, data modeling can help undercover complex relationships at your
school that are not easily visible in our usual tables and charts.
However, predictive analytics is not the golden ticket to enrollment
success. You will need to understand not only what the model is telling
you, but also the risks associated with being incorrect. Lastly, once
you have your “model”, how do you actually use it? You will need to
think about how you incorporate your new model into your current
decision making processes.

Over the last few weeks, I feel like the discussion around “Predictive
Analytics” within Enrollment Management has really picked up steam.
There are a ton of great vendors out there, but the aim of this post is
to show you how simple it can be do build predictive models internally,
for the price of “on-the-house”. I don’t mean to imply that machine
learning is easy by any stretch, but I do intend to highlight how
quickly models **can** be built. If you know your data, and understand
various techniques, model building isn’t the hard part. More than likely
though, you will want to take some time to think about your data and the
output you see. Not to mention, how you would actually operationalize
your model so that it runs quietly behind the scenes.

My goal is to try to write a post on how we can do
`predictive analytics`

in Enrollment Management using `R`

. In this
example, we will fit a model to predict if an applicant is admitted. In
full disclosure, I am going to avoid the technical details as much as
possible, although understanding **how** these models work is critically
important.

Let the debate around predictive analytics begin! I am just kidding, but there has been quite a bit of press recently on the usage of predictive analytics within higher ed and Enrollment Management. Here are a few (self-edited, sometimes snarky) headlines.

- Colleges are using Big Data to predict which students will do well
- The Future of Predictive Analytics in Higher Ed
- Political Style Targeting
- FAFSA data

I do think it’s worth noting that `predictive analytics`

in actually not
a new concept. Technology is making it much easier to do, although the
underlying methodologies have been applied to higher ed for some time
now. Below are just a few journal articles.

- Enrollment Models Using Data Mining
- Data Mining: A Magic Technology for College Recruitment
- Differential Pricing in Undergraduate Education

I included the last link above because “pricing” is a pretty hot topic
at the moment as well. One one hand, you have school’s blocking College
Abacus, which is basically
Kayak for college pricing. On the other, institutions
are required to report all sorts of data to the government through
IPEDS, where it is displayed on a number of
sites including the College Affordability and Transparency
Center. My point? There is an academic
argument for each side of the debate, whether its predictive analytics
or transparency. Outside of the financial reporting of public companies,
what other industry has to openly report their *performance* at this
level of detail to the public? As such, the trends of our industry are
forcing us to think differently about how we do things. Now that it’s
here, we need to start to get comfortable with what *it* can do. More
importantly though, we need to understand the risks associated with
modeling our enrollment data.

As mentioned a few times above, I am going to use the open-sourced
statistical programming language, `R`

, to download and model our data.
Here is our workflow:

- Grab a dataset from the web
- Fit a predictive model (logistic regression)
- Assess the accuracy of the model

If you are reading this post and are a regular `SPSS`

user, this next
step is pretty cool. `R`

allows us to grab data from the web. If you
were just using `SPSS`

, it would require that you scrape (or download)
the data, and then fire up the software to read in the external dataset.
That’s way too much effort! The code below grabs a very small admissions
dataset. If you are an analyst, you should check
out UCLA’s website. It’s a great resource for
analytical methods and code examples. Below, we will define the URL for
the dataset, and then use this value to read in the CSV file from the
web into a `data.frame`

object called `df`

.

```
URL = "https://stats.idre.ucla.edu/stat/data/binary.csv"
df = read.csv(URL)
```

Let’s confirm that the data are in our `R`

session.

```
dim(df)
[1] 400 4
summary(df)
admit gre gpa rank
Min. :0.000 Min. :220 Min. :2.26 Min. :1.00
1st Qu.:0.000 1st Qu.:520 1st Qu.:3.13 1st Qu.:2.00
Median :0.000 Median :580 Median :3.40 Median :2.00
Mean :0.318 Mean :588 Mean :3.39 Mean :2.48
3rd Qu.:1.000 3rd Qu.:660 3rd Qu.:3.67 3rd Qu.:3.00
Max. :1.000 Max. :800 Max. :4.00 Max. :4.00
```

The command `dim(df)`

simply asks `R`

to print out the dimensions our
dataset. In this case, we have 400 rows and 4 columns. The `head`

command prints the first few rows of the data, so we can see what we
have.

```
head(df)
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
```

I should have done this by now, so let’s talk about the dataset. The
first column, `admit`

, is the that variable we want to predict. In this
case, our variable represents a Yes/No decision. Yes is coded as a `1`

,
No is coded as a `0`

. This type of variable is prevalent in Enrollment
Management. To name a few …

- Does a suspect respond to our search campaign?
- Does a recruit apply?
- Do we retain a student?
- Will the student graduate in 4 years?
- Does the student pay a deposit?
- Does the student melt (between May and September)?
- Does the recruit open up the next email we send them?

Even if the variable doesn’t exist in a natural Yes/No state, we can
usually force our data into this format. The other 3 variables are our
`features`

, or `predictor`

variables. We will be using `gre`

, `gpa`

, and
`rank`

to predict the applicant’s status into graduate school. The
variable `gre`

is numeric and on an 800 scale, `gpa`

is also numeric on
a 4.0 scale, and rank appears to be categorical, with values ranging 1-4
based on the admission’s counselors read of the student.

Now let’s fit our predictive model. `R`

is really flexible. All I have
to do below is tell `R`

to fit a model where I am trying to predict
`admit`

given every other value in the database. Below, I indicate this
concept using the syntax `admit ~ .`

```
yield_model = glm(admit ~ ., data = df, family = binomial())
summary(yield_model)
Call:
glm(formula = admit ~ ., family = binomial(), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.580 -0.885 -0.638 1.157 2.173
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44955 1.13285 -3.05 0.0023 **
gre 0.00229 0.00109 2.10 0.0356 *
gpa 0.77701 0.32748 2.37 0.0177 *
rank -0.56003 0.12714 -4.40 1.1e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 459.44 on 396 degrees of freedom
AIC: 467.4
Number of Fisher Scoring iterations: 4
```

When we use the summary command above, we print out the “fit” of the
model. In the section called `Coefficients:`

, we get the estimated
weights, or effects, of each variable on the admission status.

Now that we have fit a model, let’s “score” the our data. Imagine that you were using last year’s applicant pool to predict the admission status of this year’s class. In the code below, we are going to append the probability of being admitted. We can then use this score to assess how “acccurate” our predicted value truly is.

```
df = transform(df, score = predict(yield_model, newdata = df, type = "response"))
```

When we used the `summary`

command earlier, we printed out some basic
stats on the variables in our dataset. Because `admit`

is coded as
`0/1`

, the average of this variable is equivalent to the proportion of
`admit = Yes`

in the dataset. In this case, 32% of the applicants were
admitted. This is important because our model will calibrate the scores
relative to this proprtion. If our new data are wildly different, the
model will not that well. Let’s print out the distribution of predicted
scores.
Now let’s look at the distrbution of the scores based on the *actual*
admission status. If you do not already have the library `ggplot2`

installed, simply use the command `install.packages("ggplot2")`

before
executing the code below.

```
library(ggplot2)
ggplot(df, aes(x = score, fill = factor(admit))) + geom_density(alpha = 0.3)
```

It’s nice to see that the peak for the predicted score on students is
higher than for those that were rejected, but I am not thrilled by this
plot. Early on, it looks like the model was not able to accurately
differentiate between admits and rejects. Below, we are going to use
another package, `ROCR`

for some other “goodness-of-fit” metrics. For
help on this package, go here. I
highly recommend reviewing the `Powerpoint`

file that is included on the
site.

```
library(ROCR)
pred <- prediction(df$score, df$admit)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize = T, main = "Lift Chart")
```

If you view this plot from left to right, ideally the line would have
spiked “early” in the chart. In general, you typically think of a
45-degree line, and the more “lift” above this line the better. Finally,
I am going to compute a metric, `AUC`

. The higher the number, the
better. To learn more about `AUC`

, check out this
page.
For a rule-of-thumb interpretation of the score, look
here.

```
auc = performance(pred, "auc")
auc@y.values[[1]]
[1] 0.6921
```

Real quick. You may have noticed that I usually refer to access our key
values using the `$`

operator, but needed to use `@`

above. This is
because the object returned from `performance`

is of `S4`

class in `R`

.
The more you play around, you will see this object class appear from
time-to-time, but usually can access your data using `$`

. From above, we
see that the `AUC`

for our model is 0.6921. In truth, the model doesn’t
fit that well. Intuitively, we can confirm this by binning our scores
into deciles and looking at the actual admit rate within each band.

```
library(plyr)
## add a new variable, band, which puts the score into 10 groups
df = transform(df, band = cut(score, breaks = seq(0, 1, 0.1), right = FALSE))
## create a summary table, by group, that looks at some summary stats for
## each band
ddply(df, .(band), summarise, applicants = length(admit), admits = sum(admit),
admit_rate = mean(admit))
band applicants admits admit_rate
1 [0,0.1) 15 1 0.06667
2 [0.1,0.2) 88 15 0.17045
3 [0.2,0.3) 84 22 0.26190
4 [0.3,0.4) 105 31 0.29524
5 [0.4,0.5) 59 29 0.49153
6 [0.5,0.6) 32 19 0.59375
7 [0.6,0.7) 15 9 0.60000
8 [0.7,0.8) 2 1 0.50000
```

For example, there were 2 applicants that had a predicted probability of admission status between 70-79%. Of these 2 applicants, only 1 was admitted. In a perfect world, the higher the score, we would have seen larger “true” admit rates.

Hopefully this was a fairly gentle introduction to how quickly you can fit a predictive model for your EM team. Conceptually, it doesn’t have to be hard, although interpreting the results can be tricky. Regardless, you can explore what is possible for free with open-sourced statistical software. Hey, you might even have some fun writing code!

comments powered by Disqus