Modeling for Actionable Insights with XGBoost - What Can You Do about Your Predictions?
Introduction
Lets talk modeling for actionable insights! Building a predictive model is only the first step as your end user or customer wont know what to do with an AUC or RMSE score, but if you can tell them WHO is at risk, WHY and WHAT they can do about it - thats actionable and can even be translated into dollar amounts!! And Were going to do it with XGBoost on a C5.0 dataset entitled Customer Churn MORE: Blog or code: http://www.viralml.com/video-content.html?fm=yt&v=XfPND5wA7Vw Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS Transcript Hello Friends lets talk modeling for the actionable insight! What do I mean by that? Well, building a predictive model is only the first step as your end user or customer wont know what to do with an AUC or RMSE score, but if you can tell them who is at risk, why and what they can do about it - thats actionable and can even be translated into dollar amounts!! And Were going to do it with XGBoost on a dataset called customer churn welcome to ViralML, my name in Manuel Amunategui, am the author of Monetizing ML, to extend you machine learning models to the web so everybody can enjoy them and even look at a way to monetize them through paywalls. I also have a free class all on youtube Start with the first then work your way down. So, signup for my newsletter and connect and subscribe. So, back to actionable insight. If you can tell your customer how to prevent someone from dropping out of your service, and it costs them $1000 dollars to acquire that person. You can put a dollar amount on the model and thats a language that those in charge that write checks understand - your employer and customer will love you. Were going to use a data set from.... CATEGORY:DataScience HASCODE:Modeling-For-Actionable-Insight.html
Code
from IPython.display import Image
Image(filename='double logos.png')
We'll use a data set called "Customer Churn". As the name implies, the data contains customer information and usage records from a phone company including whether the customer churned or not. It contains full day use, international plans, and customer service calls to understand and predict patterns of churn.
You can find the data set on many GitHub repos, on C5.0, and at http://amunategui.github.io/customer_churn.csv
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import seaborn as sns
warnings.filterwarnings("ignore")
#churn_df = pd.read_csv('http://amunategui.github.io/customer_churn.csv')
churn_df = pd.read_csv('customer_churn.csv')
churn_df.head()
Feature Engineering¶
# Binarize area codes
churn_df['Area Code'] = churn_df['Area Code'].apply(str)
pd.get_dummies(churn_df['Area Code']).head()
More Feature Engineering - Transform true/false yes/no text into numerics¶
churn_df['State'].value_counts()[0:10]
# fix the outcome
churn_df['Churn?'] = np.where(churn_df['Churn?'] == 'True.', 1, 0)
churn_df["Int'l Plan"] = np.where(churn_df["Int'l Plan"] == 'yes', 1, 0)
churn_df['VMail Plan'] = np.where(churn_df['VMail Plan'] == 'yes', 1, 0)
# dummify states
pd.get_dummies(churn_df['State']).head()
# binarize categorical columns
churn_df = pd.concat([churn_df, pd.get_dummies(churn_df['State'])], axis=1)
churn_df = pd.concat([churn_df, pd.get_dummies(churn_df['Area Code'])], axis=1)
churn_df.head()
# # check for nulls in data and impute if necessary
# for feat in list(churn_df):
# if (len(churn_df[feat]) - churn_df[feat].count()) > 0:
# print(feat)
# print(len(churn_df[feat]) - churn_df[feat].count())
# # tmp_df.loc[tmp_df[feat].isnull(), feat] = 0
churn_df.head()
list(churn_df)
features = [feat for feat in list(churn_df) if feat not in ['State', 'Churn?', 'Phone', 'Area Code']]
outcome = 'Churn?'
# run simple xgboost classification model and check
# prep modeling code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(churn_df,
churn_df[outcome],
test_size=0.3,
random_state=42)
import xgboost as xgb
xgb_params = {
'max_depth':3,
'eta':0.05,
'silent':0,
'eval_metric':'auc',
'subsample': 0.8,
'colsample_bytree': 0.8,
'objective':'binary:logistic',
'seed' : 0
}
dtrain = xgb.DMatrix(X_train[features], y_train, feature_names = features)
dtest = xgb.DMatrix(X_test[features], y_test, feature_names = features)
evals = [(dtrain,'train'),(dtest,'eval')]
xgb_model = xgb.train (params = xgb_params,
dtrain = dtrain,
num_boost_round = 2000,
verbose_eval=50,
early_stopping_rounds = 500,
evals=evals,
#feval = f1_score_cust,
maximize = True)
# plot the important features
fig, ax = plt.subplots(figsize=(6,9))
xgb.plot_importance(xgb_model, height=0.8, ax=ax)
plt.show()
# get dataframe version of important feature for model
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
xgb_fea_imp.head(10)
Creating top/bottom percentiles to determine under/over use¶
churn_df['Day Mins'].quantile(0.25)
churn_df['Day Mins'].quantile(0.75)
pred_churn = xgb_model.predict(dtest)
plt.plot(sorted(pred_churn))
plt.grid()
# get all numerical features
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_features = list(X_test.head().select_dtypes(include=numerics))
features_to_ignore = ['Account Length', 'Area Code','Churn?', 'Will_Churn']
numeric_features = [nf for nf in numeric_features if nf not in features_to_ignore]
row_counter = 0
X_test['Will_Churn'] = pred_churn
new_df = []
for index, row in X_test.iterrows():
if row['Will_Churn'] > 0.8:
row_counter += 1
new_df.append(row[list(churn_df)])
for feat in numeric_features:
# only consider high prob churns
if row[feat] < X_test[feat].quantile(0.25):
print('(ID:', row_counter, ')', feat, ' is < than 25 percentile')
if row[feat] > X_test[feat].quantile(0.75):
print('(ID:', row_counter, ')', feat, ' is > than 75 percentile')
new_df[0]
# get all known not to churn
not_churn = X_train[X_train['Churn?']==False].copy()
find_closet_df = []
# add row to find insights
find_closet_df.append(new_df[0])
for index, row in not_churn.iterrows():
find_closet_df.append(row[list(churn_df)])
find_closet_df = pd.DataFrame(find_closet_df)
find_closet_df['ID'] = [idx for idx in range(1,len(find_closet_df)+1)]
find_closet_df.head()
Find Closest Clusters to the Embedded Churn Risk¶
from sklearn.cluster import KMeans
num_clusters = 20
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(find_closet_df[features])
labels = kmeans.labels_
find_closet_df['clusters'] = labels
find_closet_df.head()
We compare the row with high-probability of churn against non-churns¶
We find 13 rows of non-churn resembling row 0 with the high-probability of churn, thus we recommend offering day-time credits to this customer.
find_closet_df[find_closet_df['clusters']==6][features]
find_closet_df.head()
def risk_compare(cluster_df, cluster_number, var1, var2):
mydat = find_closet_df.copy()
mydat = mydat[mydat['clusters'] == cluster_number]
mydat = mydat[[var1, var2, 'clusters']]
# differentiate high-risk churn customer
mydat.iat[0, 2] = 0
sns.lmplot(var1, var2, data=mydat,
fit_reg=False, hue="clusters",
scatter_kws={"marker": "D", "s": 100})
plt.xlabel(var1)
plt.ylabel(var2)
plt.show()
risk_compare(find_closet_df.copy(), 6, 'Night Mins', 'Night Calls')
risk_compare(find_closet_df.copy(), 6, 'Day Mins', 'Eve Mins')
Show Notes
(pardon typos and formatting -these are the notes I use to make the videos)
Lets talk modeling for actionable insights! Building a predictive model is only the first step as your end user or customer wont know what to do with an AUC or RMSE score, but if you can tell them WHO is at risk, WHY and WHAT they can do about it - thats actionable and can even be translated into dollar amounts!! And Were going to do it with XGBoost on a C5.0 dataset entitled Customer Churn MORE: Blog or code: http://www.viralml.com/video-content.html?fm=yt&v=XfPND5wA7Vw Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS Transcript Hello Friends lets talk modeling for the actionable insight! What do I mean by that? Well, building a predictive model is only the first step as your end user or customer wont know what to do with an AUC or RMSE score, but if you can tell them who is at risk, why and what they can do about it - thats actionable and can even be translated into dollar amounts!! And Were going to do it with XGBoost on a dataset called customer churn welcome to ViralML, my name in Manuel Amunategui, am the author of Monetizing ML, to extend you machine learning models to the web so everybody can enjoy them and even look at a way to monetize them through paywalls. I also have a free class all on youtube Start with the first then work your way down. So, signup for my newsletter and connect and subscribe. So, back to actionable insight. If you can tell your customer how to prevent someone from dropping out of your service, and it costs them $1000 dollars to acquire that person. You can put a dollar amount on the model and thats a language that those in charge that write checks understand - your employer and customer will love you. Were going to use a data set from.... CATEGORY:DataScience HASCODE:Modeling-For-Actionable-Insight.html