My New Udemy Class: Fundamental Market Analysis with Python

Fundamental Market Analysis with Python - Find Your Own Answers On What Is Going on in the Financial Markets

Hot off the Press!

My New Book: "The Little Book of Fundamental Analysis: Hands-On Market Analysis with Python" is Out!

Grow Your Web Brand, Visibility & Traffic Organically

5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up

Introduction

Applied data science is about everything that goes before and after your model - and its critically important! Join me for a walkthrough on a great but often ignore skillset.

Code

Student Retention Model

Student Retention Model - Start to Finish, From Modeling Student Behavior to Helping At-Risk Cases¶

In [343]:
from IPython.display import Image
Image(filename='viralml-book.png')

Out[343]:

ViralML.com

Student Performance Data Set¶

https://archive.ics.uci.edu/ml/datasets/student+performance

Data Set Information¶

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2 sex - student's sex (binary: 'F' - female or 'M' - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)

# these grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)

Download the data fond in the Data Folder and Data Set links and store it in the same folder where you intend to run the model.

In [344]:
import time
import random
import sys
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import seaborn as sns
warnings.filterwarnings("ignore")

In [345]:
student_por_df = pd.read_csv('student-por.csv', sep=';')

Out[345]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

In [346]:
student_por_df['G3'].describe()

Out[346]:
count    649.000000
mean      11.906009
std        3.230656
min        0.000000
25%       10.000000
50%       12.000000
75%       14.000000
max       19.000000
Name: G3, dtype: float64
In [335]:
student_por_df.shape

Out[335]:
(649, 33)
In [347]:
plt.plot(sorted(student_por_df['G3']))
plt.grid()

In [337]:
student_math_df = pd.read_csv('student-mat.csv', sep=';')

Out[337]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

In [338]:
student_por_df.describe()

Out[338]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
count 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000 649.000000
mean 16.744222 2.514638 2.306626 1.568567 1.930663 0.221880 3.930663 3.180277 3.184900 1.502311 2.280431 3.536210 3.659476 11.399076 11.570108 11.906009
std 1.218138 1.134552 1.099931 0.748660 0.829510 0.593235 0.955717 1.051093 1.175766 0.924834 1.284380 1.446259 4.640759 2.745265 2.913639 3.230656
min 15.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 16.000000 2.000000 1.000000 1.000000 1.000000 0.000000 4.000000 3.000000 2.000000 1.000000 1.000000 2.000000 0.000000 10.000000 10.000000 10.000000
50% 17.000000 2.000000 2.000000 1.000000 2.000000 0.000000 4.000000 3.000000 3.000000 1.000000 2.000000 4.000000 2.000000 11.000000 11.000000 12.000000
75% 18.000000 4.000000 3.000000 2.000000 2.000000 0.000000 5.000000 4.000000 4.000000 2.000000 3.000000 5.000000 6.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 4.000000 3.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 32.000000 19.000000 19.000000 19.000000
In [339]:
student_math_df.shape

Out[339]:
(395, 33)

Exploration and feature engineering¶

In [318]:
# find all non-numerical data
non_mueric_features = [feat for feat in list(student_por_df) if feat not in list(student_por_df._get_numeric_data())]
for feat in non_mueric_features:
print(feat, ':', set(student_por_df[feat]))


school : {'GP', 'MS'}
sex : {'M', 'F'}
famsize : {'LE3', 'GT3'}
Pstatus : {'T', 'A'}
Mjob : {'other', 'at_home', 'teacher', 'services', 'health'}
Fjob : {'other', 'at_home', 'services', 'teacher', 'health'}
reason : {'course', 'other', 'reputation', 'home'}
guardian : {'other', 'father', 'mother'}
schoolsup : {'yes', 'no'}
famsup : {'yes', 'no'}
paid : {'yes', 'no'}
activities : {'yes', 'no'}
nursery : {'yes', 'no'}
higher : {'yes', 'no'}
internet : {'yes', 'no'}
romantic : {'yes', 'no'}

In [348]:
for feat in non_mueric_features:
dummies = pd.get_dummies(student_por_df[feat]).rename(columns=lambda x: feat + '_' + str(x))
student_por_df = pd.concat([student_por_df, dummies], axis=1)

student_por_df = student_por_df[[feat for feat in list(student_por_df) if feat not in non_mueric_features]]

In [349]:
student_por_df.shape

Out[349]:
(649, 59)
In [351]:
student_por_df.head()

Out[351]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc ... activities_no activities_yes nursery_no nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes
0 18 4 4 2 2 0 4 3 4 1 ... 1 0 0 1 0 1 1 0 1 0
1 17 1 1 1 2 0 5 3 3 1 ... 1 0 1 0 0 1 0 1 1 0
2 15 1 1 1 2 0 4 3 2 2 ... 1 0 0 1 0 1 0 1 1 0
3 15 4 2 1 3 0 3 2 2 1 ... 0 1 0 1 0 1 0 1 0 1
4 16 3 3 1 2 0 4 3 2 1 ... 1 0 0 1 0 1 1 0 1 0

5 rows × 59 columns

In [352]:
# create an xgboost model
# run simple xgboost classification model and check
# prep modeling code
outcome = 'G3'
features = [feat for feat in list(student_por_df) if feat not in outcome]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(student_por_df,
student_por_df[outcome],
test_size=0.3,
random_state=42)

import xgboost  as xgb
xgb_params = {
'eta': 0.01,
'max_depth': 3,
'subsample': 0.7,
'colsample_bytree': 0.7,
'objective': 'reg:linear',
'seed' : 0
}

dtrain = xgb.DMatrix(X_train[features], y_train, feature_names = features)
dtest = xgb.DMatrix(X_test[features], y_test, feature_names = features)
evals = [(dtrain,'train'),(dtest,'eval')]
xgb_model = xgb.train (params = xgb_params,
dtrain = dtrain,
num_boost_round = 2000,
verbose_eval=50,
early_stopping_rounds = 500,
evals=evals,
#feval = f1_score_cust,
maximize = False)


[0]	train-rmse:11.6497	eval-rmse:11.9429
Multiple eval metrics have been passed: 'eval-rmse' will be used for early stopping.

Will train until eval-rmse hasn't improved in 500 rounds.
[50]	train-rmse:7.18897	eval-rmse:7.39667
[100]	train-rmse:4.50517	eval-rmse:4.63686
[150]	train-rmse:2.90578	eval-rmse:2.99388
[200]	train-rmse:1.98163	eval-rmse:2.05763
[250]	train-rmse:1.47129	eval-rmse:1.56264
[300]	train-rmse:1.20491	eval-rmse:1.32764
[350]	train-rmse:1.0693	eval-rmse:1.22909
[400]	train-rmse:0.995233	eval-rmse:1.18691
[450]	train-rmse:0.948895	eval-rmse:1.17263
[500]	train-rmse:0.916889	eval-rmse:1.16601
[550]	train-rmse:0.886623	eval-rmse:1.16357
[600]	train-rmse:0.862894	eval-rmse:1.16218
[650]	train-rmse:0.838753	eval-rmse:1.16097
[700]	train-rmse:0.819014	eval-rmse:1.16227
[750]	train-rmse:0.798209	eval-rmse:1.16462
[800]	train-rmse:0.782396	eval-rmse:1.16587
[850]	train-rmse:0.765591	eval-rmse:1.165
[900]	train-rmse:0.748515	eval-rmse:1.16553
[950]	train-rmse:0.732683	eval-rmse:1.16676
[1000]	train-rmse:0.717379	eval-rmse:1.16631
[1050]	train-rmse:0.702918	eval-rmse:1.16689
[1100]	train-rmse:0.687903	eval-rmse:1.16918
Stopping. Best iteration:
[640]	train-rmse:0.842125	eval-rmse:1.16081


In [353]:
# find poor performing students and find out why they are so compared to their peers
# plot the important features
fig, ax = plt.subplots(figsize=(6,9))
xgb.plot_importance(xgb_model,  height=0.8, ax=ax, max_num_features=20)

plt.show()

In [354]:
# get dataframe version of important feature for model
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)

Out[354]:
feature importance
0 G2 1301
2 G1 669
1 absences 444
21 Dalc 412
13 age 390
10 freetime 279
19 health 262
17 famrel 239
37 traveltime 236
22 goout 236
In [325]:
print(xgb_model.predict(dtest)[0:10])

[16.843275 11.575173 17.108978 11.135402 11.382752 16.290283 17.242054
10.359439 10.89928  10.70853 ]

In [355]:
key_features = list(xgb_fea_imp['feature'].values[0:40])
key_features

Out[355]:
['G2',
'G1',
'absences',
'Dalc',
'age',
'freetime',
'health',
'famrel',
'traveltime',
'goout',
'Medu',
'Fedu',
'Walc',
'studytime',
'failures',
'reason_other',
'Fjob_services',
'schoolsup_no',
'Mjob_other',
'romantic_no',
'famsup_no',
'sex_F',
'Fjob_at_home',
'school_GP',
'reason_reputation',
'reason_course',
'activities_no',
'Mjob_services',
'guardian_father',
'famsize_GT3',
'nursery_no',
'Mjob_teacher',
'reason_home',
'schoolsup_yes',
'higher_no',
'Mjob_at_home',
'romantic_yes',
'internet_no',
'famsup_yes']
In [312]:
# Take students with a predicted final score of less than 10 over 20
predicted_students_in_trouble = X_test[X_test['G3'] < 10]

# See which feature they landed well below or well above peers
for index, row in predicted_students_in_trouble.iterrows():
print('Student ID:', index)
for feat in key_features:
if row[feat] < student_por_df[feat].quantile(0.25):
print('\t', 'Below:', feat, row[feat], 'Class:',
np.round(np.mean(student_por_df[feat]),2))
if row[feat] > student_por_df[feat].quantile(0.75):
print('\t','Above:', feat, row[feat], 'Class:',
np.round(np.mean(student_por_df[feat]),2))

Student ID: 131
Below: G2 9 Class: 11.57
Above: absences 10 Class: 3.66
Above: goout 5 Class: 3.18
Above: failures 3 Class: 0.22
Above: reason_reputation 1 Class: 0.22
Above: Mjob_services 1 Class: 0.21
Above: higher_no 1 Class: 0.11
Student ID: 81
Below: G2 9 Class: 11.57
Below: age 15 Class: 16.74
Above: studytime 3 Class: 1.93
Below: schoolsup_no 0 Class: 0.9
Above: nursery_no 1 Class: 0.2
Above: reason_home 1 Class: 0.23
Above: schoolsup_yes 1 Class: 0.1
Student ID: 585
Below: G2 7 Class: 11.57
Below: G1 8 Class: 11.4
Below: freetime 2 Class: 3.18
Above: studytime 3 Class: 1.93
Above: Fjob_at_home 1 Class: 0.06
Above: higher_no 1 Class: 0.11
Above: internet_no 1 Class: 0.23
Student ID: 177
Below: G2 8 Class: 11.57
Below: G1 9 Class: 11.4
Below: Medu 1 Class: 2.51
Above: Walc 4 Class: 2.28
Above: failures 1 Class: 0.22
Above: guardian_father 1 Class: 0.24
Above: higher_no 1 Class: 0.11
Above: Mjob_at_home 1 Class: 0.21
Student ID: 174
Below: G2 8 Class: 11.57
Below: G1 8 Class: 11.4
Above: absences 8 Class: 3.66
Below: famrel 3 Class: 3.93
Above: failures 1 Class: 0.22
Above: higher_no 1 Class: 0.11
Above: Mjob_at_home 1 Class: 0.21
Student ID: 478
Below: G2 7 Class: 11.57
Below: G1 7 Class: 11.4
Below: health 1 Class: 3.54
Below: famrel 3 Class: 3.93
Below: Medu 1 Class: 2.51
Above: failures 3 Class: 0.22
Below: schoolsup_no 0 Class: 0.9
Above: guardian_father 1 Class: 0.24
Above: schoolsup_yes 1 Class: 0.1
Above: Mjob_at_home 1 Class: 0.21
Above: internet_no 1 Class: 0.23
Student ID: 522
Below: G2 8 Class: 11.57
Below: G1 8 Class: 11.4
Above: Fedu 4 Class: 2.31
Student ID: 163
Below: G2 9 Class: 11.57
Below: famrel 2 Class: 3.93
Above: goout 5 Class: 3.18
Below: Medu 1 Class: 2.51
Above: Walc 5 Class: 2.28
Above: failures 2 Class: 0.22
Above: higher_no 1 Class: 0.11
Student ID: 570
Below: G2 8 Class: 11.57
Below: G1 7 Class: 11.4
Above: Walc 4 Class: 2.28
Above: Mjob_services 1 Class: 0.21
Student ID: 257
Below: G2 8 Class: 11.57
Below: freetime 2 Class: 3.18
Below: goout 1 Class: 3.18
Above: Fedu 4 Class: 2.31
Above: nursery_no 1 Class: 0.2
Above: Mjob_teacher 1 Class: 0.11
Student ID: 148
Below: G2 9 Class: 11.57
Below: G1 8 Class: 11.4
Below: age 15 Class: 16.74
Above: freetime 5 Class: 3.18
Below: health 1 Class: 3.54
Above: traveltime 3 Class: 1.57
Above: goout 5 Class: 3.18
Below: Medu 1 Class: 2.51
Above: failures 1 Class: 0.22
Student ID: 447
Below: G1 8 Class: 11.4
Above: absences 8 Class: 3.66
Above: Dalc 5 Class: 1.5
Above: freetime 5 Class: 3.18
Above: traveltime 3 Class: 1.57
Above: goout 5 Class: 3.18
Above: Walc 5 Class: 2.28
Above: reason_other 1 Class: 0.11
Above: higher_no 1 Class: 0.11
Above: internet_no 1 Class: 0.23
Student ID: 518
Below: G2 5 Class: 11.57
Below: G1 8 Class: 11.4
Above: absences 8 Class: 3.66
Below: health 1 Class: 3.54
Below: famrel 2 Class: 3.93
Above: Fedu 4 Class: 2.31
Above: failures 1 Class: 0.22
Above: reason_reputation 1 Class: 0.22
Above: guardian_father 1 Class: 0.24
Student ID: 603
Below: G2 0 Class: 11.57
Below: G1 5 Class: 11.4
Below: goout 1 Class: 3.18
Above: reason_reputation 1 Class: 0.22
Above: Mjob_teacher 1 Class: 0.11
Student ID: 514
Below: G2 6 Class: 11.57
Below: G1 7 Class: 11.4
Below: freetime 1 Class: 3.18
Below: famrel 3 Class: 3.93
Above: Walc 4 Class: 2.28
Above: Fjob_at_home 1 Class: 0.06
Above: Mjob_services 1 Class: 0.21
Student ID: 568
Below: G1 6 Class: 11.4
Above: age 19 Class: 16.74
Below: freetime 2 Class: 3.18
Below: famrel 3 Class: 3.93
Below: goout 1 Class: 3.18
Above: failures 3 Class: 0.22
Above: Mjob_at_home 1 Class: 0.21
Above: internet_no 1 Class: 0.23
Student ID: 440
Below: G2 0 Class: 11.57
Below: G1 7 Class: 11.4
Above: Dalc 4 Class: 1.5
Above: goout 5 Class: 3.18
Below: Medu 1 Class: 2.51
Above: Walc 5 Class: 2.28
Above: reason_home 1 Class: 0.23
Above: Mjob_at_home 1 Class: 0.21
Above: internet_no 1 Class: 0.23
Student ID: 443
Below: G2 9 Class: 11.57
Below: G1 7 Class: 11.4
Above: absences 7 Class: 3.66
Below: age 15 Class: 16.74
Above: reason_reputation 1 Class: 0.22
Above: guardian_father 1 Class: 0.24
Student ID: 155
Below: G2 7 Class: 11.57
Below: G1 9 Class: 11.4
Above: absences 22 Class: 3.66
Above: goout 5 Class: 3.18
Above: reason_home 1 Class: 0.23
Student ID: 248
Below: G2 9 Class: 11.57
Below: G1 9 Class: 11.4
Below: famrel 3 Class: 3.93
Below: Medu 1 Class: 2.51
Above: reason_home 1 Class: 0.23
Student ID: 494
Below: G2 9 Class: 11.57
Below: G1 8 Class: 11.4
Above: goout 5 Class: 3.18
Below: Medu 1 Class: 2.51
Above: higher_no 1 Class: 0.11
Above: Mjob_at_home 1 Class: 0.21
Student ID: 563
Below: G2 0 Class: 11.57
Below: G1 7 Class: 11.4
Below: freetime 2 Class: 3.18
Below: famrel 1 Class: 3.93
Below: goout 1 Class: 3.18
Above: failures 1 Class: 0.22
Above: internet_no 1 Class: 0.23
Student ID: 432
Below: G2 6 Class: 11.57
Below: G1 6 Class: 11.4
Below: Medu 1 Class: 2.51
Above: failures 1 Class: 0.22
Above: reason_other 1 Class: 0.11
Above: guardian_father 1 Class: 0.24
Above: nursery_no 1 Class: 0.2
Above: higher_no 1 Class: 0.11
Student ID: 583
Below: G2 6 Class: 11.57
Below: G1 8 Class: 11.4
Above: freetime 5 Class: 3.18
Above: goout 5 Class: 3.18
Above: failures 1 Class: 0.22
Above: reason_other 1 Class: 0.11
Above: higher_no 1 Class: 0.11
Student ID: 370
Below: G2 8 Class: 11.57
Below: G1 8 Class: 11.4
Above: age 19 Class: 16.74
Above: traveltime 3 Class: 1.57
Below: Medu 1 Class: 2.51
Above: failures 2 Class: 0.22
Above: nursery_no 1 Class: 0.2
Student ID: 256
Below: G2 8 Class: 11.57
Below: G1 7 Class: 11.4
Above: absences 26 Class: 3.66
Below: health 1 Class: 3.54
Above: failures 1 Class: 0.22
Above: Fjob_at_home 1 Class: 0.06
Above: nursery_no 1 Class: 0.2
Above: higher_no 1 Class: 0.11


Show Notes

(pardon typos and formatting -
these are the notes I use to make the videos)