### Let's Talk Applied Data Science - Student Retention Modeling - Time to Step Up Your Predictive Game!

### Introduction

Applied data science is about everything that goes before and after your model - and its critically important! Join me for a walkthrough on a great but often ignore skillset. MORE: Blog or code: http://www.viralml.com/video-content.html?fm=yt&v=4rDYw0LcBcI Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: https://amzn.to/2JDEU91 Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS Lets walk through a student retention model. We all love to tinker with XGBoost or Tensorflow, and that OK knowing how to model is important but only a small sliver of the applied data science picture. This is a great skill that a lot of people tend to ignore or learn much later in their career. If the case of a student retention model, looking at the full applied data science picture means that before doing any modeling, we need to talk to the professionals, teachers in this case about what are the problems they see with at-risk students and the data points they use to help them. This will be invaluable information for our models. After the modeling phase, it is about delivering a usable and actionable model, integrating it in their workflow, making sure it is what they expected and that it is as accurate as expected. That it is easy to use. It also means checking on the accuracy with live data over time to make sure it stays on target and doesnt drift. This isnt a full applied data science pipeline as the data has already been collected for us meaning that the work with the initial stakeholders and educational professionals has already begun. So, well take what we have and feed it into the modeling and workflow portions of the tasks. The UCI model we will use here offers interesting social and demographic features such as family life, social settings, alcohol consumption, etc. that can be used to model and predict a students aptitude before the class has begun. Out of the two datasets available, we will work with the Portuguese one as it contains more data. It includes 649 students and 33 features. Well use all available features, and this requires a bit of feature engineering as some of them are categorical. After binarizing and pivoting all of them, we end up with 59 features. Here are some of the most predictive ones to model a students final grade according to our model: The outcome variable is the final grade for the class which ranges between 0 and 20. XGBoost does a great job learning the students behavior and returns an RMSE score of 1.16 (this means the likelihood of the prediction will fall between +/- 1.16). I wont cover XGBoost much here as we are more interested in what comes after the modeling phase. According to the variable importance chart (which sorts each feature in order of importance according to the model) we confirm that past grades are the strongest predictor of future performance (and therefore highlights the importance of intervening as early as possible to break out a student stuck in a destructive pattern). An easy way of discovering how a particular ‚Äòat-risk student could benefit from extra support is to compare that student with his or her peers. This is trivial to automate. We simply gather all the predictive features (for simplicity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. An easy way of discovering how a particular ‚Äòat-risk student could enjoy extra support is to compare that student with his or her peers. This is trivial to automate. We gather all the predictive features (for brevity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. CATEGORY:DataScience HASCODE:ViralML-Hands-On-Student-Retention-Model.html

### Code

# Student Retention Model - Start to Finish, From Modeling Student Behavior to Helping At-Risk Cases¶

```
from IPython.display import Image
Image(filename='viralml-book.png')
```

## Student Performance Data Set¶

https://archive.ics.uci.edu/ml/datasets/student+performance

### Data Set Information¶

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

```
# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2 sex - student's sex (binary: 'F' - female or 'M' - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
# these grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
```

Download the data fond in the Data Folder and Data Set links and store it in the same folder where you intend to run the model.

```
import time
import random
import sys
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import seaborn as sns
warnings.filterwarnings("ignore")
```

```
student_por_df = pd.read_csv('student-por.csv', sep=';')
student_por_df.head()
```

```
student_por_df['G3'].describe()
```

```
student_por_df.shape
```

```
plt.plot(sorted(student_por_df['G3']))
plt.title('Final Grade Distribution')
plt.grid()
```

```
student_math_df = pd.read_csv('student-mat.csv', sep=';')
student_math_df.head()
```

```
student_por_df.describe()
```

```
student_math_df.shape
```

### Exploration and feature engineering¶

```
# find all non-numerical data
non_mueric_features = [feat for feat in list(student_por_df) if feat not in list(student_por_df._get_numeric_data())]
for feat in non_mueric_features:
print(feat, ':', set(student_por_df[feat]))
```

```
for feat in non_mueric_features:
dummies = pd.get_dummies(student_por_df[feat]).rename(columns=lambda x: feat + '_' + str(x))
student_por_df = pd.concat([student_por_df, dummies], axis=1)
student_por_df = student_por_df[[feat for feat in list(student_por_df) if feat not in non_mueric_features]]
```

```
student_por_df.shape
```

```
student_por_df.head()
```

```
# create an xgboost model
# run simple xgboost classification model and check
# prep modeling code
outcome = 'G3'
features = [feat for feat in list(student_por_df) if feat not in outcome]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(student_por_df,
student_por_df[outcome],
test_size=0.3,
random_state=42)
import xgboost as xgb
xgb_params = {
'eta': 0.01,
'max_depth': 3,
'subsample': 0.7,
'colsample_bytree': 0.7,
'objective': 'reg:linear',
'seed' : 0
}
dtrain = xgb.DMatrix(X_train[features], y_train, feature_names = features)
dtest = xgb.DMatrix(X_test[features], y_test, feature_names = features)
evals = [(dtrain,'train'),(dtest,'eval')]
xgb_model = xgb.train (params = xgb_params,
dtrain = dtrain,
num_boost_round = 2000,
verbose_eval=50,
early_stopping_rounds = 500,
evals=evals,
#feval = f1_score_cust,
maximize = False)
```

```
# find poor performing students and find out why they are so compared to their peers
# plot the important features
fig, ax = plt.subplots(figsize=(6,9))
xgb.plot_importance(xgb_model, height=0.8, ax=ax, max_num_features=20)
plt.show()
```

```
# get dataframe version of important feature for model
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
xgb_fea_imp.head(10)
```

```
print(xgb_model.predict(dtest)[0:10])
```

```
key_features = list(xgb_fea_imp['feature'].values[0:40])
key_features
```

```
# Take students with a predicted final score of less than 10 over 20
predicted_students_in_trouble = X_test[X_test['G3'] < 10]
# See which feature they landed well below or well above peers
for index, row in predicted_students_in_trouble.iterrows():
print('Student ID:', index)
for feat in key_features:
if row[feat] < student_por_df[feat].quantile(0.25):
print('\t', 'Below:', feat, row[feat], 'Class:',
np.round(np.mean(student_por_df[feat]),2))
if row[feat] > student_por_df[feat].quantile(0.75):
print('\t','Above:', feat, row[feat], 'Class:',
np.round(np.mean(student_por_df[feat]),2))
```

### Show Notes

(pardon typos and formatting -these are the notes I use to make the videos)

Applied data science is about everything that goes before and after your model - and its critically important! Join me for a walkthrough on a great but often ignore skillset. MORE: Blog or code: http://www.viralml.com/video-content.html?fm=yt&v=4rDYw0LcBcI Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: https://amzn.to/2JDEU91 Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS Lets walk through a student retention model. We all love to tinker with XGBoost or Tensorflow, and that OK knowing how to model is important but only a small sliver of the applied data science picture. This is a great skill that a lot of people tend to ignore or learn much later in their career. If the case of a student retention model, looking at the full applied data science picture means that before doing any modeling, we need to talk to the professionals, teachers in this case about what are the problems they see with at-risk students and the data points they use to help them. This will be invaluable information for our models. After the modeling phase, it is about delivering a usable and actionable model, integrating it in their workflow, making sure it is what they expected and that it is as accurate as expected. That it is easy to use. It also means checking on the accuracy with live data over time to make sure it stays on target and doesnt drift. This isnt a full applied data science pipeline as the data has already been collected for us meaning that the work with the initial stakeholders and educational professionals has already begun. So, well take what we have and feed it into the modeling and workflow portions of the tasks. The UCI model we will use here offers interesting social and demographic features such as family life, social settings, alcohol consumption, etc. that can be used to model and predict a students aptitude before the class has begun. Out of the two datasets available, we will work with the Portuguese one as it contains more data. It includes 649 students and 33 features. Well use all available features, and this requires a bit of feature engineering as some of them are categorical. After binarizing and pivoting all of them, we end up with 59 features. Here are some of the most predictive ones to model a students final grade according to our model: The outcome variable is the final grade for the class which ranges between 0 and 20. XGBoost does a great job learning the students behavior and returns an RMSE score of 1.16 (this means the likelihood of the prediction will fall between +/- 1.16). I wont cover XGBoost much here as we are more interested in what comes after the modeling phase. According to the variable importance chart (which sorts each feature in order of importance according to the model) we confirm that past grades are the strongest predictor of future performance (and therefore highlights the importance of intervening as early as possible to break out a student stuck in a destructive pattern). An easy way of discovering how a particular ‚Äòat-risk student could benefit from extra support is to compare that student with his or her peers. This is trivial to automate. We simply gather all the predictive features (for simplicity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. An easy way of discovering how a particular ‚Äòat-risk student could enjoy extra support is to compare that student with his or her peers. This is trivial to automate. We gather all the predictive features (for brevity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. CATEGORY:DataScience HASCODE:ViralML-Hands-On-Student-Retention-Model.html