Using Autoencoders and Keras to Better Understand your Customers
Introduction
An autoencoder is a great tool to dig deep into your data. If you are unsure of what to focus on or you want to look at the bigger picture, an unsupervised or semi-supervised model might be able to give you fresh new insights and new areas to investigate. It is a great compliment to a traditional supervised model.
Code
from IPython.display import Image
Image(filename='autoencoders-keras.png', width='80%')
Using Autoencoders to Better Understand your Customers¶
Measuring Customer Credit Risk with AutoEncoders and Keras 2.0¶
Let's Look at a Simple Credit-Risk Example and Unearth New Patterns, Anomalies, and Actionable Insights¶
How to use unsupervised learning to get actionable insights¶
An autoencoder is a great tool to dig deep into your data. If you are unsure of what to focus on or you want to look at the bigger picture, a unsupervised or semi-surpervised model might be able to give you fresh new insights and new areas to invstetigate. It is a great compliment to a traditional supervised model.
In this walk-through, we'll see how we can apply autoencoders to customer's seeking loans and flag any abnormal behavior. We will use as our Autoencoder Keras code base a great walk-through from Chitta Ranjan "Extreme Rare Event Classification using Autoencoders in Keras" but we will take it further and reach actionable insights as we all know, there isn't much data science with actionable insights.
The Open Source Statlog (German Credit Data) Data Set from UC Irvine Machine Learning Repository¶
"This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix"
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
The Autoencoder and anomaly detection¶
Autoencoding mostly aims at reducing feature space in order to distill the essential aspects of the data versus more conventional deeplearning which blows up the feature space to capture non-linearities and subtle interactions within the data. Autoencoding can also be seen as a non-linear alternative to PCA. It is similar to what we use in image, music, and file compression. We compress the excess until the data is to distorted to be of any value.
Anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset - Wikipedia.com
Anomaly Detection is a big scientific domain, and with such big domains, come many associated techniques and tools. The autoencoder is one of those tools and the subject of this walk-through.
Finding actionable insights¶
At the end of the day, actionable insights is what data science is all about. If a person cannot apply the findings and predictions from a data science project, it is of little use besides as an academic exercise. What we are doing here, is usng Chitta's approach of training an autoencoder model using only a subset of the data - in this case only customer with good credit profiles. This engages the encoder to learn what a "good" customer looks like by reducing that data to its core essentials. Once the model is trained, we take out-of-sample data and have the model predict (i.e. compress then reconstruct the data) and compare the error reconstruction value. The smaller the error, the closer the customer is to the "good" profile, the larger the error, the more anomalous it is.
Investigative work¶
The out-of-sample customers with high reconstruction errors are to be scrutenized closely to understand why they are different from the typical "good" customer.
#https://towardsdatascience.com/extreme-rare-event-classification-using-autoencoders-in-keras-a565b386f098
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tensorflow.keras import regularizers
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve
from sklearn.metrics import recall_score, classification_report, auc, roc_curve, accuracy_score
from sklearn.metrics import precision_recall_fscore_support, f1_score
from pandas.api.types import is_numeric_dtype
import urllib
Download data from UC Irvine Machine Learning Repository¶
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
url="http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
raw_data = urllib.request.urlopen(url)
credit=pd.DataFrame(raw_data)
data = pd.read_csv(url, delimiter=' ', header=None)
data.columns = ['HasChecking', 'DurationInMonths', 'CreditHistory', 'CreditPurpose', 'CreditAmount',
'SavingsAccount', 'EmployedSince', 'InstallmentRatePercentIncome', 'StatusGender',
'OtherDebtorsGuarantors', 'ResidenceSince', 'Property', 'Age', 'OtherInstallmentPlans', 'Housing',
'NumberExistingCredits', 'Job', 'FamilyLiablities', 'HasPhone', 'ForeignWorker', 'CreditRisk']
data.head()
Untangle of categorical data¶
# Quick untangle of categorical data
numerical_features = [f for f in list(data) if is_numeric_dtype(data[f])]
numerical_features
non_numerical_features = [f for f in list(data) if f not in numerical_features]
non_numerical_features
# make dummy vars out of non_numerical_features
data_ready = pd.get_dummies(data, columns=non_numerical_features,
drop_first=False,
dummy_na=False)
print(data_ready.shape)
data_ready.head()
# from data notes we know that: (1 = Good, 2 = Bad)
data_ready[['CreditRisk','HasChecking_A11',
'HasChecking_A12',
'HasChecking_A13',
'HasChecking_A14']].groupby('CreditRisk').sum()
# fix binary outcome - good versus bad credit risk
# from data notes we know that: (1 = Good, 2 = Bad) so 0 will be good credit and 1 will be bad credit
data_ready['CreditRisk'].replace([1,2], [0,1], inplace=True)
data_ready['CreditRisk'].value_counts()
# save data set to file to run it into FastML.io
data_ready.to_csv("german-credit-scores-ready.csv", index=None)
Prepare train/test split for autoencoder¶
print(data_ready.shape)
features = [f for f in list(data_ready) if f not in ['CreditRisk']]
print(len(features))
X_train, X_test, Y_train, Y_test = train_test_split(data_ready[features],
data_ready['CreditRisk'],
test_size=0.3,
random_state=1)
print('Data split - Train:', len(X_train), 'Test:', len(X_test))
# check for nulls
X_train.isnull().values.sum()
Create set of negative outcomes only¶
# Create set of negative outcomes only
X_train_0 = X_train.copy()
X_train_0['CreditRisk'] = Y_train
X_train_0 = X_train_0[X_train_0['CreditRisk']==0]
X_train_0 = X_train_0.drop('CreditRisk', axis=1)
X_test_0 = X_test.copy()
X_test_0['CreditRisk'] = Y_test
X_test_0 = X_test_0[X_test_0['CreditRisk']==0]
X_test_0 = X_test_0.drop('CreditRisk', axis=1)
# Auto encoder parameters
nb_epoch = 500
batch_size = 128
input_dim = X_train_0.shape[1]
encoding_dim = 24
hidden_dim = int(encoding_dim / 2)
learning_rate = 1e-3
# set up autoencoder layers
input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, activation="relu", activity_regularizer=regularizers.l1(learning_rate))(input_layer)
encoder = Dense(hidden_dim, activation="relu")(encoder)
decoder = Dense(hidden_dim, activation="relu")(encoder)
decoder = Dense(encoding_dim, activation="relu")(decoder)
decoder = Dense(input_dim, activation="linear")(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.summary()
autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='adam')
cp = ModelCheckpoint(filepath="autoencoder_classifier.h5",
save_best_only=True,
verbose=0)
tb = TensorBoard(log_dir='./logs',
histogram_freq=0,
write_graph=True,
write_images=True)
history = autoencoder.fit(X_train_0, X_train_0,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
validation_data=(X_test_0, X_test_0),
verbose=1,
callbacks=[cp, tb]).history
test_x_predictions = autoencoder.predict(X_test)
print(test_x_predictions.shape)
mse = np.mean(np.power(X_test - test_x_predictions, 2), axis=1)
mse
Y_test.value_counts()
fpr, tpr, thresholds = roc_curve(Y_test, mse)
print('thresholds', np.mean(thresholds))
auc(fpr, tpr)
threshold_fixed = 21.05
accuracy_score(Y_test, [1 if s > threshold_fixed else 0 for s in mse])
error_df_test = pd.DataFrame({'Reconstruction_error': mse,
'True_class': Y_test})
error_df_test = error_df_test.reset_index()
groups = error_df_test.groupby('True_class')
fig, ax = plt.subplots()
for name, group in groups:
ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',
label= "Bad Credit Risk" if name == 1 else "Good Credit Risk")
ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors="r",
zorder=100, label='Threshold')
ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show();
pred_y = [1 if e > threshold_fixed else 0 for e in error_df_test['Reconstruction_error'].values]
conf_matrix = confusion_matrix(error_df_test['True_class'], pred_y)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix,
xticklabels=["Good Credit Risk","Bad Credit Risk"],
yticklabels=["Good Credit Risk","Bad Credit Risk"],
annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
false_pos_rate, true_pos_rate, thresholds = roc_curve(error_df_test['True_class'],
error_df_test['Reconstruction_error'])
roc_auc = auc(false_pos_rate, true_pos_rate,)
plt.plot(false_pos_rate, true_pos_rate, linewidth=5, label='AUC = %0.3f'% roc_auc)
plt.plot([0,1],[0,1], linewidth=5)
plt.xlim([-0.01, 1])
plt.ylim([0, 1.01])
plt.legend(loc='lower right')
plt.title('Receiver operating characteristic curve (ROC)')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
What's going on with these anomalies?¶
Data legend¶
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
# What are some of those features that we see via poor reconstruction error?
X_test['Error'] = mse
X_test['CreditRisk'] = Y_test
X_test = X_test.sort_values('Error', ascending=False)
X_test.head()
normal_values = X_test[X_test['CreditRisk']==0].mean()
normal_values
diff_values = normal_values - X_test.head(10)
diff_values = diff_values.T
diff_values
# pick an index header and analyze
customer = 236
to_analyze = pd.DataFrame(diff_values[customer])
to_analyze = to_analyze.sort_values(customer, ascending=True)
print(to_analyze.head(1))
print(to_analyze.tail(1))
X_test[['CreditAmount','CreditRisk']].groupby('CreditRisk').mean()
X_test[['DurationInMonths','CreditRisk']].groupby('CreditRisk').mean()
# pick an index header and analyze
customer = 887
to_analyze = pd.DataFrame(diff_values[customer])
to_analyze = to_analyze.sort_values(customer, ascending=True)
print(to_analyze.head(1))
print(to_analyze.tail(1))
Show Notes
(pardon typos and formatting -these are the notes I use to make the videos)
An autoencoder is a great tool to dig deep into your data. If you are unsure of what to focus on or you want to look at the bigger picture, an unsupervised or semi-supervised model might be able to give you fresh new insights and new areas to investigate. It is a great compliment to a traditional supervised model.