Improving Student Recruitment in USC Suzanne Dworak-Perk, School of Social Work

Programming Language: Python

I will briefly explain the business problem and recommendations, followed by how to build several predictive models to get actionable recommendations.


  • Business Problem: 45% of students who accepted offer didn’t end up joining
  • Recommendations: Tame influential factors and Sell diversity more

Business Problem

  1. Students not joining school after accepting the offers
  2. Waitlist candidates denied the chance to fill up vacant seats
  3. Need for a data based approach


  1. Do regular surveys to understand student pain points regarding choice of university
  2. Ensure shorter turnaround time for applications by increasing number of reviewers
  3. Nominate students ambassadors from eclectic backgrounds and ask them to share their stories with new admits

Note: Performed data cleaning and feature engineering before building models. Codes below focusing more on building predictive models using Logistic Regression, Decision Tree, Boosted Tree and Random Forest


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("data_master2.csv")
X = data.loc[:,'Gap':'residency']
y = data.loc[:,'response']

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
print("Training features/target:", X_train.shape, y_train.shape)
print("Testing features/target:", X_test.shape, y_test.shape)
Training features/target: (3054, 28) (3054,)
Testing features/target: (2037, 28) (2037,), y_train)
logreg.score(X_train, y_train)
y_pred = logreg.predict(X_test)
from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')           

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
array([[ 772,   77],
       [  12, 1176]])
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
Predicted 0 1 All
0 772 77 849
1 12 1176 1188
All 784 1253 2037

Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, random_state=0), y_train)
rf.score(X_test, y_test)
y_pred2 = rf.predict(X_test)
pd.crosstab(y_test, y_pred2, rownames=['True'], colnames=['Predicted'], margins=True)
Predicted 0 1 All
0 791 58 849
1 28 1160 1188
All 819 1218 2037
y_pred_prob2 = rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob2)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Random Forest')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve')   

Decision Tree

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=7, random_state=0), y_train)
tree.score(X_test, y_test)
y_pred3 = tree.predict(X_test)
pd.crosstab(y_test, y_pred3, rownames=['True'], colnames=['Predicted'], margins=True)
Predicted 0 1 All
0 762 87 849
1 22 1166 1188
All 784 1253 2037
y_pred_prob3 = tree.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob3)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Decision Tree')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Decision Tree ROC Curve')  

Boosted Tree

from sklearn.ensemble import AdaBoostClassifier
boost = AdaBoostClassifier(), y_train)
boost.score(X_test, y_test)
y_pred4 = boost.predict(X_test)
pd.crosstab(y_test, y_pred4, rownames=['True'], colnames=['Predicted'], margins=True)
Predicted 0 1 All
0 778 71 849
1 30 1158 1188
All 808 1229 2037
y_pred_prob4 = boost.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob4)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Boosted Tree')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Boosted Tree ROC Curve')

Feature Importances

%matplotlib inline
pd.Series(tree.feature_importances_, index=X.columns).sort_values(0, ascending=True).plot.barh(figsize=(18,7));

from sklearn.tree import export_graphviz
import sys, subprocess
from IPython.display import Image

export_graphviz(tree, feature_names=X.columns, class_names=['failure','success'],
                out_file='', impurity=False, filled=True)