Enrollment Nerdery

A place to collect my thoughts on data analysis within Enrollment Management. Dare I call it Enrollment Science?



Where you can find me
GitHub
Twitter
LinkedIn
Resume
Email

Predictive Analytics in Enrollment Management

Abstract:

A simple example of predictive analytics for Enrollment Managers using FREE tools.

TL;DR:

Using R we can fit all sorts of complex models in Enrollment Management, quickly, and for no cost. In truth, data modeling can help undercover complex relationships at your school that are not easily visible in our usual tables and charts. However, predictive analytics is not the golden ticket to enrollment success. You will need to understand not only what the model is telling you, but also the risks associated with being incorrect. Lastly, once you have your “model”, how do you actually use it? You will need to think about how you incorporate your new model into your current decision making processes.

Why Write this Post?

Over the last few weeks, I feel like the discussion around “Predictive Analytics” within Enrollment Management has really picked up steam. There are a ton of great vendors out there, but the aim of this post is to show you how simple it can be do build predictive models internally, for the price of “on-the-house”. I don’t mean to imply that machine learning is easy by any stretch, but I do intend to highlight how quickly models can be built. If you know your data, and understand various techniques, model building isn’t the hard part. More than likely though, you will want to take some time to think about your data and the output you see. Not to mention, how you would actually operationalize your model so that it runs quietly behind the scenes.

Quick Overview

My goal is to try to write a post on how we can do predictive analytics in Enrollment Management using R. In this example, we will fit a model to predict if an applicant is admitted. In full disclosure, I am going to avoid the technical details as much as possible, although understanding how these models work is critically important.

Previous Work and Discussion

Let the debate around predictive analytics begin! I am just kidding, but there has been quite a bit of press recently on the usage of predictive analytics within higher ed and Enrollment Management. Here are a few (self-edited, sometimes snarky) headlines.

I do think it’s worth noting that predictive analytics in actually not a new concept. Technology is making it much easier to do, although the underlying methodologies have been applied to higher ed for some time now. Below are just a few journal articles.

I included the last link above because “pricing” is a pretty hot topic at the moment as well. One one hand, you have school’s blocking College Abacus, which is basically Kayak for college pricing. On the other, institutions are required to report all sorts of data to the government through IPEDS, where it is displayed on a number of sites including the College Affordability and Transparency Center. My point? There is an academic argument for each side of the debate, whether its predictive analytics or transparency. Outside of the financial reporting of public companies, what other industry has to openly report their performance at this level of detail to the public? As such, the trends of our industry are forcing us to think differently about how we do things. Now that it’s here, we need to start to get comfortable with what it can do. More importantly though, we need to understand the risks associated with modeling our enrollment data.

The Process

As mentioned a few times above, I am going to use the open-sourced statistical programming language, R, to download and model our data. Here is our workflow:

  1. Grab a dataset from the web
  2. Fit a predictive model (logistic regression)
  3. Assess the accuracy of the model

1) Lets grab the data

If you are reading this post and are a regular SPSS user, this next step is pretty cool. R allows us to grab data from the web. If you were just using SPSS, it would require that you scrape (or download) the data, and then fire up the software to read in the external dataset. That’s way too much effort! The code below grabs a very small admissions dataset. If you are an analyst, you should check out UCLA’s website. It’s a great resource for analytical methods and code examples. Below, we will define the URL for the dataset, and then use this value to read in the CSV file from the web into a data.frame object called df.

URL = "https://stats.idre.ucla.edu/stat/data/binary.csv"
df = read.csv(URL)

Let’s confirm that the data are in our R session.

dim(df)

[1] 400   4

summary(df)

     admit            gre           gpa            rank     
 Min.   :0.000   Min.   :220   Min.   :2.26   Min.   :1.00  
 1st Qu.:0.000   1st Qu.:520   1st Qu.:3.13   1st Qu.:2.00  
 Median :0.000   Median :580   Median :3.40   Median :2.00  
 Mean   :0.318   Mean   :588   Mean   :3.39   Mean   :2.48  
 3rd Qu.:1.000   3rd Qu.:660   3rd Qu.:3.67   3rd Qu.:3.00  
 Max.   :1.000   Max.   :800   Max.   :4.00   Max.   :4.00  

The command dim(df) simply asks R to print out the dimensions our dataset. In this case, we have 400 rows and 4 columns. The head command prints the first few rows of the data, so we can see what we have.

head(df)

  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4
6     1 760 3.00    2

I should have done this by now, so let’s talk about the dataset. The first column, admit, is the that variable we want to predict. In this case, our variable represents a Yes/No decision. Yes is coded as a 1, No is coded as a 0. This type of variable is prevalent in Enrollment Management. To name a few …

Even if the variable doesn’t exist in a natural Yes/No state, we can usually force our data into this format. The other 3 variables are our features, or predictor variables. We will be using gre, gpa, and rank to predict the applicant’s status into graduate school. The variable gre is numeric and on an 800 scale, gpa is also numeric on a 4.0 scale, and rank appears to be categorical, with values ranging 1-4 based on the admission’s counselors read of the student.

2) Fit a Model

Now let’s fit our predictive model. R is really flexible. All I have to do below is tell R to fit a model where I am trying to predict admit given every other value in the database. Below, I indicate this concept using the syntax admit ~ .

yield_model = glm(admit ~ ., data = df, family = binomial())
summary(yield_model)

Call:
glm(formula = admit ~ ., family = binomial(), data = df)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.580  -0.885  -0.638   1.157   2.173  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.44955    1.13285   -3.05   0.0023 ** 
gre          0.00229    0.00109    2.10   0.0356 *  
gpa          0.77701    0.32748    2.37   0.0177 *  
rank        -0.56003    0.12714   -4.40  1.1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.44  on 396  degrees of freedom
AIC: 467.4

Number of Fisher Scoring iterations: 4

When we use the summary command above, we print out the “fit” of the model. In the section called Coefficients:, we get the estimated weights, or effects, of each variable on the admission status.

3) Assess the Model

Now that we have fit a model, let’s “score” the our data. Imagine that you were using last year’s applicant pool to predict the admission status of this year’s class. In the code below, we are going to append the probability of being admitted. We can then use this score to assess how “acccurate” our predicted value truly is.

df = transform(df, score = predict(yield_model, newdata = df, type = "response"))

When we used the summary command earlier, we printed out some basic stats on the variables in our dataset. Because admit is coded as 0/1, the average of this variable is equivalent to the proportion of admit = Yes in the dataset. In this case, 32% of the applicants were admitted. This is important because our model will calibrate the scores relative to this proprtion. If our new data are wildly different, the model will not that well. Let’s print out the distribution of predicted scores. plot of chunk
unnamed-chunk-5 Now let’s look at the distrbution of the scores based on the actual admission status. If you do not already have the library ggplot2 installed, simply use the command install.packages("ggplot2") before executing the code below.

library(ggplot2)
ggplot(df, aes(x = score, fill = factor(admit))) + geom_density(alpha = 0.3)

plot of chunk
unnamed-chunk-6 It’s nice to see that the peak for the predicted score on students is higher than for those that were rejected, but I am not thrilled by this plot. Early on, it looks like the model was not able to accurately differentiate between admits and rejects. Below, we are going to use another package, ROCR for some other “goodness-of-fit” metrics. For help on this package, go here. I highly recommend reviewing the Powerpoint file that is included on the site.

library(ROCR)
pred <- prediction(df$score, df$admit)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize = T, main = "Lift Chart")

plot of chunk
unnamed-chunk-7 If you view this plot from left to right, ideally the line would have spiked “early” in the chart. In general, you typically think of a 45-degree line, and the more “lift” above this line the better. Finally, I am going to compute a metric, AUC. The higher the number, the better. To learn more about AUC, check out this page. For a rule-of-thumb interpretation of the score, look here.

auc = performance(pred, "auc")
auc@y.values[[1]]

[1] 0.6921

Real quick. You may have noticed that I usually refer to access our key values using the $ operator, but needed to use @ above. This is because the object returned from performance is of S4 class in R. The more you play around, you will see this object class appear from time-to-time, but usually can access your data using $. From above, we see that the AUC for our model is 0.6921. In truth, the model doesn’t fit that well. Intuitively, we can confirm this by binning our scores into deciles and looking at the actual admit rate within each band.

library(plyr)

## add a new variable, band, which puts the score into 10 groups
df = transform(df, band = cut(score, breaks = seq(0, 1, 0.1), right = FALSE))

## create a summary table, by group, that looks at some summary stats for
## each band
ddply(df, .(band), summarise, applicants = length(admit), admits = sum(admit), 
    admit_rate = mean(admit))

       band applicants admits admit_rate
1   [0,0.1)         15      1    0.06667
2 [0.1,0.2)         88     15    0.17045
3 [0.2,0.3)         84     22    0.26190
4 [0.3,0.4)        105     31    0.29524
5 [0.4,0.5)         59     29    0.49153
6 [0.5,0.6)         32     19    0.59375
7 [0.6,0.7)         15      9    0.60000
8 [0.7,0.8)          2      1    0.50000

For example, there were 2 applicants that had a predicted probability of admission status between 70-79%. Of these 2 applicants, only 1 was admitted. In a perfect world, the higher the score, we would have seen larger “true” admit rates.

4) Summary

Hopefully this was a fairly gentle introduction to how quickly you can fit a predictive model for your EM team. Conceptually, it doesn’t have to be hard, although interpreting the results can be tricky. Regardless, you can explore what is possible for free with open-sourced statistical software. Hey, you might even have some fun writing code!  

comments powered by Disqus