Get the "Applied Data Science Edge"!

The ViralML School

Fundamental Market Analysis with Python - Find Your Own Answers On What Is Going on in the Financial Markets

Web Work

Python Web Work - Prototyping Guide for Maker

Use HTML5 Templates, Serve Dynamic Content, Build Machine Learning Web Apps, Grow Audiences & Conquer the World!

Hot off the Press!

The Little Book of Fundamental Market Indicators

My New Book: "The Little Book of Fundamental Analysis: Hands-On Market Analysis with Python" is Out!

Let's Talk Applied Data Science - Student Retention Modeling - Time to Step Up Your Predictive Game!

Introduction

Applied data science is about everything that goes before and after your model - and its critically important! Join me for a walkthrough on a great but often ignore skillset. MORE: Blog or code: http://www.viralml.com/video-content.html?fm=yt&v=4rDYw0LcBcI Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: https://amzn.to/2JDEU91 Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS Lets walk through a student retention model. We all love to tinker with XGBoost or Tensorflow, and that OK knowing how to model is important but only a small sliver of the applied data science picture. This is a great skill that a lot of people tend to ignore or learn much later in their career. If the case of a student retention model, looking at the full applied data science picture means that before doing any modeling, we need to talk to the professionals, teachers in this case about what are the problems they see with at-risk students and the data points they use to help them. This will be invaluable information for our models. After the modeling phase, it is about delivering a usable and actionable model, integrating it in their workflow, making sure it is what they expected and that it is as accurate as expected. That it is easy to use. It also means checking on the accuracy with live data over time to make sure it stays on target and doesnt drift. This isnt a full applied data science pipeline as the data has already been collected for us meaning that the work with the initial stakeholders and educational professionals has already begun. So, well take what we have and feed it into the modeling and workflow portions of the tasks. The UCI model we will use here offers interesting social and demographic features such as family life, social settings, alcohol consumption, etc. that can be used to model and predict a students aptitude before the class has begun. Out of the two datasets available, we will work with the Portuguese one as it contains more data. It includes 649 students and 33 features. Well use all available features, and this requires a bit of feature engineering as some of them are categorical. After binarizing and pivoting all of them, we end up with 59 features. Here are some of the most predictive ones to model a students final grade according to our model: The outcome variable is the final grade for the class which ranges between 0 and 20. XGBoost does a great job learning the students behavior and returns an RMSE score of 1.16 (this means the likelihood of the prediction will fall between +/- 1.16). I wont cover XGBoost much here as we are more interested in what comes after the modeling phase. According to the variable importance chart (which sorts each feature in order of importance according to the model) we confirm that past grades are the strongest predictor of future performance (and therefore highlights the importance of intervening as early as possible to break out a student stuck in a destructive pattern). An easy way of discovering how a particular ‘at-risk student could benefit from extra support is to compare that student with his or her peers. This is trivial to automate. We simply gather all the predictive features (for simplicity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. An easy way of discovering how a particular ‘at-risk student could enjoy extra support is to compare that student with his or her peers. This is trivial to automate. We gather all the predictive features (for brevity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. CATEGORY:DataScience HASCODE:ViralML-Hands-On-Student-Retention-Model.html



If you liked it, please share it:

Code

Student Retention Model

Student Retention Model - Start to Finish, From Modeling Student Behavior to Helping At-Risk Cases

In [343]:
from IPython.display import Image
Image(filename='viralml-book.png')
Out[343]:

Student Performance Data Set

https://archive.ics.uci.edu/ml/datasets/student+performance

Data Set Information

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 
1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) 
2 sex - student's sex (binary: 'F' - female or 'M' - male) 
3 age - student's age (numeric: from 15 to 22) 
4 address - student's home address type (binary: 'U' - urban or 'R' - rural) 
5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) 
6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart) 
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 
9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 
10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 
11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') 
12 guardian - student's guardian (nominal: 'mother', 'father' or 'other') 
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 
16 schoolsup - extra educational support (binary: yes or no) 
17 famsup - family educational support (binary: yes or no) 
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 
19 activities - extra-curricular activities (binary: yes or no) 
20 nursery - attended nursery school (binary: yes or no) 
21 higher - wants to take higher education (binary: yes or no) 
22 internet - Internet access at home (binary: yes or no) 
23 romantic - with a romantic relationship (binary: yes or no) 
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 
29 health - current health status (numeric: from 1 - very bad to 5 - very good) 
30 absences - number of school absences (numeric: from 0 to 93) 

# these grades are related with the course subject, Math or Portuguese: 
31 G1 - first period grade (numeric: from 0 to 20) 
31 G2 - second period grade (numeric: from 0 to 20) 
32 G3 - final grade (numeric: from 0 to 20, output target)

Download the data fond in the Data Folder and Data Set links and store it in the same folder where you intend to run the model.

In [344]:
import time
import random
import sys
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import seaborn as sns
warnings.filterwarnings("ignore")
In [345]:
student_por_df = pd.read_csv('student-por.csv', sep=';')
student_por_df.head()
Out[345]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

In [346]:
student_por_df['G3'].describe()
Out[346]:
count    649.000000
mean      11.906009
std        3.230656
min        0.000000
25%       10.000000
50%       12.000000
75%       14.000000
max       19.000000
Name: G3, dtype: float64
In [335]:
student_por_df.shape
Out[335]:
(649, 33)
In [347]:
plt.plot(sorted(student_por_df['G3']))
plt.title('Final Grade Distribution')
plt.grid()
In [337]:
student_math_df = pd.read_csv('student-mat.csv', sep=';')
student_math_df.head()
Out[337]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

In [338]:
student_por_df.describe()
Out[338]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
count 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000
mean 16.744222 2.514638 2.306626 1.568567 1.930663 0.221880 3.930663 3.180277 3.184900 1.502311 2.280431 3.536210 3.659476 11.399076 11.570108 11.906009
std 1.218138 1.134552 1.099931 0.748660 0.829510 0.593235 0.955717 1.051093 1.175766 0.924834 1.284380 1.446259 4.640759 2.745265 2.913639 3.230656
min 15.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 16.000000 2.000000 1.000000 1.000000 1.000000 0.000000 4.000000 3.000000 2.000000 1.000000 1.000000 2.000000 0.000000 10.000000 10.000000 10.000000
50% 17.000000 2.000000 2.000000 1.000000 2.000000 0.000000 4.000000 3.000000 3.000000 1.000000 2.000000 4.000000 2.000000 11.000000 11.000000 12.000000
75% 18.000000 4.000000 3.000000 2.000000 2.000000 0.000000 5.000000 4.000000 4.000000 2.000000 3.000000 5.000000 6.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 4.000000 3.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 32.000000 19.000000 19.000000 19.000000
In [339]:
student_math_df.shape
Out[339]:
(395, 33)

Exploration and feature engineering

In [318]:
# find all non-numerical data
non_mueric_features = [feat for feat in list(student_por_df) if feat not in list(student_por_df._get_numeric_data())]
for feat in non_mueric_features:
    print(feat, ':', set(student_por_df[feat])) 
    
   
school : {'GP', 'MS'}
sex : {'M', 'F'}
address : {'R', 'U'}
famsize : {'LE3', 'GT3'}
Pstatus : {'T', 'A'}
Mjob : {'other', 'at_home', 'teacher', 'services', 'health'}
Fjob : {'other', 'at_home', 'services', 'teacher', 'health'}
reason : {'course', 'other', 'reputation', 'home'}
guardian : {'other', 'father', 'mother'}
schoolsup : {'yes', 'no'}
famsup : {'yes', 'no'}
paid : {'yes', 'no'}
activities : {'yes', 'no'}
nursery : {'yes', 'no'}
higher : {'yes', 'no'}
internet : {'yes', 'no'}
romantic : {'yes', 'no'}
In [348]:
for feat in non_mueric_features:
    dummies = pd.get_dummies(student_por_df[feat]).rename(columns=lambda x: feat + '_' + str(x))
    student_por_df = pd.concat([student_por_df, dummies], axis=1)
    
student_por_df = student_por_df[[feat for feat in list(student_por_df) if feat not in non_mueric_features]]
In [349]:
student_por_df.shape
Out[349]:
(649, 59)
In [351]:
student_por_df.head()
Out[351]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc ... activities_no activities_yes nursery_no nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes
0 18 4 4 2 2 0 4 3 4 1 ... 1 0 0 1 0 1 1 0 1 0
1 17 1 1 1 2 0 5 3 3 1 ... 1 0 1 0 0 1 0 1 1 0
2 15 1 1 1 2 0 4 3 2 2 ... 1 0 0 1 0 1 0 1 1 0
3 15 4 2 1 3 0 3 2 2 1 ... 0 1 0 1 0 1 0 1 0 1
4 16 3 3 1 2 0 4 3 2 1 ... 1 0 0 1 0 1 1 0 1 0

5 rows × 59 columns

In [352]:
# create an xgboost model
# run simple xgboost classification model and check 
# prep modeling code
outcome = 'G3'
features = [feat for feat in list(student_por_df) if feat not in outcome]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(student_por_df, 
                                                 student_por_df[outcome], 
                                                 test_size=0.3, 
                                                 random_state=42)


import xgboost  as xgb
xgb_params = {
    'eta': 0.01,
    'max_depth': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'objective': 'reg:linear',
    'seed' : 0
}

dtrain = xgb.DMatrix(X_train[features], y_train, feature_names = features)
dtest = xgb.DMatrix(X_test[features], y_test, feature_names = features)
evals = [(dtrain,'train'),(dtest,'eval')]
xgb_model = xgb.train (params = xgb_params,
              dtrain = dtrain,
              num_boost_round = 2000,
              verbose_eval=50, 
              early_stopping_rounds = 500,
              evals=evals,
              #feval = f1_score_cust,
              maximize = False)
 
[0]	train-rmse:11.6497	eval-rmse:11.9429
Multiple eval metrics have been passed: 'eval-rmse' will be used for early stopping.

Will train until eval-rmse hasn't improved in 500 rounds.
[50]	train-rmse:7.18897	eval-rmse:7.39667
[100]	train-rmse:4.50517	eval-rmse:4.63686
[150]	train-rmse:2.90578	eval-rmse:2.99388
[200]	train-rmse:1.98163	eval-rmse:2.05763
[250]	train-rmse:1.47129	eval-rmse:1.56264
[300]	train-rmse:1.20491	eval-rmse:1.32764
[350]	train-rmse:1.0693	eval-rmse:1.22909
[400]	train-rmse:0.995233	eval-rmse:1.18691
[450]	train-rmse:0.948895	eval-rmse:1.17263
[500]	train-rmse:0.916889	eval-rmse:1.16601
[550]	train-rmse:0.886623	eval-rmse:1.16357
[600]	train-rmse:0.862894	eval-rmse:1.16218
[650]	train-rmse:0.838753	eval-rmse:1.16097
[700]	train-rmse:0.819014	eval-rmse:1.16227
[750]	train-rmse:0.798209	eval-rmse:1.16462
[800]	train-rmse:0.782396	eval-rmse:1.16587
[850]	train-rmse:0.765591	eval-rmse:1.165
[900]	train-rmse:0.748515	eval-rmse:1.16553
[950]	train-rmse:0.732683	eval-rmse:1.16676
[1000]	train-rmse:0.717379	eval-rmse:1.16631
[1050]	train-rmse:0.702918	eval-rmse:1.16689
[1100]	train-rmse:0.687903	eval-rmse:1.16918
Stopping. Best iteration:
[640]	train-rmse:0.842125	eval-rmse:1.16081

In [353]:
# find poor performing students and find out why they are so compared to their peers
# plot the important features  
fig, ax = plt.subplots(figsize=(6,9))
xgb.plot_importance(xgb_model,  height=0.8, ax=ax, max_num_features=20)

plt.show()
In [354]:
# get dataframe version of important feature for model 
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
xgb_fea_imp.head(10)
Out[354]:
feature importance
0 G2 1301
2 G1 669
1 absences 444
21 Dalc 412
13 age 390
10 freetime 279
19 health 262
17 famrel 239
37 traveltime 236
22 goout 236
In [325]:
print(xgb_model.predict(dtest)[0:10])
[16.843275 11.575173 17.108978 11.135402 11.382752 16.290283 17.242054
 10.359439 10.89928  10.70853 ]
In [355]:
key_features = list(xgb_fea_imp['feature'].values[0:40])
key_features
Out[355]:
['G2',
 'G1',
 'absences',
 'Dalc',
 'age',
 'freetime',
 'health',
 'famrel',
 'traveltime',
 'goout',
 'Medu',
 'Fedu',
 'Walc',
 'studytime',
 'failures',
 'reason_other',
 'Fjob_services',
 'schoolsup_no',
 'Mjob_other',
 'romantic_no',
 'famsup_no',
 'sex_F',
 'Fjob_at_home',
 'school_GP',
 'reason_reputation',
 'reason_course',
 'activities_no',
 'Mjob_services',
 'guardian_father',
 'famsize_GT3',
 'nursery_no',
 'Mjob_teacher',
 'address_R',
 'reason_home',
 'schoolsup_yes',
 'higher_no',
 'Mjob_at_home',
 'romantic_yes',
 'internet_no',
 'famsup_yes']
In [312]:
# Take students with a predicted final score of less than 10 over 20
predicted_students_in_trouble = X_test[X_test['G3'] < 10]

# See which feature they landed well below or well above peers
for index, row in predicted_students_in_trouble.iterrows():
    print('Student ID:', index)
    for feat in key_features:
        if row[feat] < student_por_df[feat].quantile(0.25):
            print('\t', 'Below:', feat, row[feat], 'Class:', 
                  np.round(np.mean(student_por_df[feat]),2))
        if row[feat] > student_por_df[feat].quantile(0.75):
            print('\t','Above:', feat, row[feat], 'Class:', 
                  np.round(np.mean(student_por_df[feat]),2))
Student ID: 131
	 Below: G2 9 Class: 11.57
	 Above: absences 10 Class: 3.66
	 Above: goout 5 Class: 3.18
	 Above: failures 3 Class: 0.22
	 Above: reason_reputation 1 Class: 0.22
	 Above: Mjob_services 1 Class: 0.21
	 Above: higher_no 1 Class: 0.11
Student ID: 81
	 Below: G2 9 Class: 11.57
	 Below: age 15 Class: 16.74
	 Above: studytime 3 Class: 1.93
	 Below: schoolsup_no 0 Class: 0.9
	 Above: nursery_no 1 Class: 0.2
	 Above: reason_home 1 Class: 0.23
	 Above: schoolsup_yes 1 Class: 0.1
Student ID: 585
	 Below: G2 7 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Below: freetime 2 Class: 3.18
	 Above: studytime 3 Class: 1.93
	 Above: Fjob_at_home 1 Class: 0.06
	 Above: higher_no 1 Class: 0.11
	 Above: internet_no 1 Class: 0.23
Student ID: 177
	 Below: G2 8 Class: 11.57
	 Below: G1 9 Class: 11.4
	 Below: Medu 1 Class: 2.51
	 Above: Walc 4 Class: 2.28
	 Above: failures 1 Class: 0.22
	 Above: guardian_father 1 Class: 0.24
	 Above: higher_no 1 Class: 0.11
	 Above: Mjob_at_home 1 Class: 0.21
Student ID: 174
	 Below: G2 8 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Above: absences 8 Class: 3.66
	 Below: famrel 3 Class: 3.93
	 Above: failures 1 Class: 0.22
	 Above: higher_no 1 Class: 0.11
	 Above: Mjob_at_home 1 Class: 0.21
Student ID: 478
	 Below: G2 7 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Below: health 1 Class: 3.54
	 Below: famrel 3 Class: 3.93
	 Below: Medu 1 Class: 2.51
	 Above: failures 3 Class: 0.22
	 Below: schoolsup_no 0 Class: 0.9
	 Above: guardian_father 1 Class: 0.24
	 Above: schoolsup_yes 1 Class: 0.1
	 Above: Mjob_at_home 1 Class: 0.21
	 Above: internet_no 1 Class: 0.23
Student ID: 522
	 Below: G2 8 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Above: Fedu 4 Class: 2.31
Student ID: 163
	 Below: G2 9 Class: 11.57
	 Below: famrel 2 Class: 3.93
	 Above: goout 5 Class: 3.18
	 Below: Medu 1 Class: 2.51
	 Above: Walc 5 Class: 2.28
	 Above: failures 2 Class: 0.22
	 Above: higher_no 1 Class: 0.11
Student ID: 570
	 Below: G2 8 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Above: Walc 4 Class: 2.28
	 Above: Mjob_services 1 Class: 0.21
Student ID: 257
	 Below: G2 8 Class: 11.57
	 Below: freetime 2 Class: 3.18
	 Below: goout 1 Class: 3.18
	 Above: Fedu 4 Class: 2.31
	 Above: nursery_no 1 Class: 0.2
	 Above: Mjob_teacher 1 Class: 0.11
Student ID: 148
	 Below: G2 9 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Below: age 15 Class: 16.74
	 Above: freetime 5 Class: 3.18
	 Below: health 1 Class: 3.54
	 Above: traveltime 3 Class: 1.57
	 Above: goout 5 Class: 3.18
	 Below: Medu 1 Class: 2.51
	 Above: failures 1 Class: 0.22
Student ID: 447
	 Below: G1 8 Class: 11.4
	 Above: absences 8 Class: 3.66
	 Above: Dalc 5 Class: 1.5
	 Above: freetime 5 Class: 3.18
	 Above: traveltime 3 Class: 1.57
	 Above: goout 5 Class: 3.18
	 Above: Walc 5 Class: 2.28
	 Above: reason_other 1 Class: 0.11
	 Above: higher_no 1 Class: 0.11
	 Above: internet_no 1 Class: 0.23
Student ID: 518
	 Below: G2 5 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Above: absences 8 Class: 3.66
	 Below: health 1 Class: 3.54
	 Below: famrel 2 Class: 3.93
	 Above: Fedu 4 Class: 2.31
	 Above: failures 1 Class: 0.22
	 Above: reason_reputation 1 Class: 0.22
	 Above: guardian_father 1 Class: 0.24
Student ID: 603
	 Below: G2 0 Class: 11.57
	 Below: G1 5 Class: 11.4
	 Below: goout 1 Class: 3.18
	 Above: reason_reputation 1 Class: 0.22
	 Above: Mjob_teacher 1 Class: 0.11
Student ID: 514
	 Below: G2 6 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Below: freetime 1 Class: 3.18
	 Below: famrel 3 Class: 3.93
	 Above: Walc 4 Class: 2.28
	 Above: Fjob_at_home 1 Class: 0.06
	 Above: Mjob_services 1 Class: 0.21
Student ID: 568
	 Below: G1 6 Class: 11.4
	 Above: age 19 Class: 16.74
	 Below: freetime 2 Class: 3.18
	 Below: famrel 3 Class: 3.93
	 Below: goout 1 Class: 3.18
	 Above: failures 3 Class: 0.22
	 Above: Mjob_at_home 1 Class: 0.21
	 Above: internet_no 1 Class: 0.23
Student ID: 440
	 Below: G2 0 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Above: Dalc 4 Class: 1.5
	 Above: goout 5 Class: 3.18
	 Below: Medu 1 Class: 2.51
	 Above: Walc 5 Class: 2.28
	 Above: reason_home 1 Class: 0.23
	 Above: Mjob_at_home 1 Class: 0.21
	 Above: internet_no 1 Class: 0.23
Student ID: 443
	 Below: G2 9 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Above: absences 7 Class: 3.66
	 Below: age 15 Class: 16.74
	 Above: reason_reputation 1 Class: 0.22
	 Above: guardian_father 1 Class: 0.24
Student ID: 155
	 Below: G2 7 Class: 11.57
	 Below: G1 9 Class: 11.4
	 Above: absences 22 Class: 3.66
	 Above: goout 5 Class: 3.18
	 Above: reason_home 1 Class: 0.23
Student ID: 248
	 Below: G2 9 Class: 11.57
	 Below: G1 9 Class: 11.4
	 Below: famrel 3 Class: 3.93
	 Below: Medu 1 Class: 2.51
	 Above: reason_home 1 Class: 0.23
Student ID: 494
	 Below: G2 9 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Above: goout 5 Class: 3.18
	 Below: Medu 1 Class: 2.51
	 Above: higher_no 1 Class: 0.11
	 Above: Mjob_at_home 1 Class: 0.21
Student ID: 563
	 Below: G2 0 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Below: freetime 2 Class: 3.18
	 Below: famrel 1 Class: 3.93
	 Below: goout 1 Class: 3.18
	 Above: failures 1 Class: 0.22
	 Above: internet_no 1 Class: 0.23
Student ID: 432
	 Below: G2 6 Class: 11.57
	 Below: G1 6 Class: 11.4
	 Below: Medu 1 Class: 2.51
	 Above: failures 1 Class: 0.22
	 Above: reason_other 1 Class: 0.11
	 Above: guardian_father 1 Class: 0.24
	 Above: nursery_no 1 Class: 0.2
	 Above: higher_no 1 Class: 0.11
Student ID: 583
	 Below: G2 6 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Above: freetime 5 Class: 3.18
	 Above: goout 5 Class: 3.18
	 Above: failures 1 Class: 0.22
	 Above: reason_other 1 Class: 0.11
	 Above: higher_no 1 Class: 0.11
Student ID: 370
	 Below: G2 8 Class: 11.57
	 Below: G1 8 Class: 11.4
	 Above: age 19 Class: 16.74
	 Above: traveltime 3 Class: 1.57
	 Below: Medu 1 Class: 2.51
	 Above: failures 2 Class: 0.22
	 Above: nursery_no 1 Class: 0.2
Student ID: 256
	 Below: G2 8 Class: 11.57
	 Below: G1 7 Class: 11.4
	 Above: absences 26 Class: 3.66
	 Below: health 1 Class: 3.54
	 Above: failures 1 Class: 0.22
	 Above: Fjob_at_home 1 Class: 0.06
	 Above: nursery_no 1 Class: 0.2
	 Above: higher_no 1 Class: 0.11

Show Notes

(pardon typos and formatting -
these are the notes I use to make the videos)

Applied data science is about everything that goes before and after your model - and its critically important! Join me for a walkthrough on a great but often ignore skillset. MORE: Blog or code: http://www.viralml.com/video-content.html?fm=yt&v=4rDYw0LcBcI Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: https://amzn.to/2JDEU91 Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS Lets walk through a student retention model. We all love to tinker with XGBoost or Tensorflow, and that OK knowing how to model is important but only a small sliver of the applied data science picture. This is a great skill that a lot of people tend to ignore or learn much later in their career. If the case of a student retention model, looking at the full applied data science picture means that before doing any modeling, we need to talk to the professionals, teachers in this case about what are the problems they see with at-risk students and the data points they use to help them. This will be invaluable information for our models. After the modeling phase, it is about delivering a usable and actionable model, integrating it in their workflow, making sure it is what they expected and that it is as accurate as expected. That it is easy to use. It also means checking on the accuracy with live data over time to make sure it stays on target and doesnt drift. This isnt a full applied data science pipeline as the data has already been collected for us meaning that the work with the initial stakeholders and educational professionals has already begun. So, well take what we have and feed it into the modeling and workflow portions of the tasks. The UCI model we will use here offers interesting social and demographic features such as family life, social settings, alcohol consumption, etc. that can be used to model and predict a students aptitude before the class has begun. Out of the two datasets available, we will work with the Portuguese one as it contains more data. It includes 649 students and 33 features. Well use all available features, and this requires a bit of feature engineering as some of them are categorical. After binarizing and pivoting all of them, we end up with 59 features. Here are some of the most predictive ones to model a students final grade according to our model: The outcome variable is the final grade for the class which ranges between 0 and 20. XGBoost does a great job learning the students behavior and returns an RMSE score of 1.16 (this means the likelihood of the prediction will fall between +/- 1.16). I wont cover XGBoost much here as we are more interested in what comes after the modeling phase. According to the variable importance chart (which sorts each feature in order of importance according to the model) we confirm that past grades are the strongest predictor of future performance (and therefore highlights the importance of intervening as early as possible to break out a student stuck in a destructive pattern). An easy way of discovering how a particular ‘at-risk student could benefit from extra support is to compare that student with his or her peers. This is trivial to automate. We simply gather all the predictive features (for simplicity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. An easy way of discovering how a particular ‘at-risk student could enjoy extra support is to compare that student with his or her peers. This is trivial to automate. We gather all the predictive features (for brevity, well only use those that showed up in the variable importance chart above) for the at-risk students and compare them against the 25 and 75 percentiles of the class. CATEGORY:DataScience HASCODE:ViralML-Hands-On-Student-Retention-Model.html