CatBoost vs XGBoost - A Gentle Introduction to CatBoost - Free Udemy Class
Introduction
XGBoost is one of the most powerful boosted models in existence until now... here comes CatBoost. Let's explore how it compares to XGBoost using Python and also explore CatBoost on both a classification dataset and a regression one. Let's have some fun!
Code
from IPython.display import Image
Image(filename='CatBoost vs Xgboost.png' )
Let's pit CatBoost against XGBoost in a friendly classificiation battle! Cat Fight Time
I got some of my best scores on Kaggle using it! At one point, I was ranked 185th and I thank XGBoost ( https://www.kaggle.com/amunategui ), lot's of others thanked XGBoost too. We still thank it today - it's integrated all over the place - scikit-learn, cloud providers, I use it everyday for customers in GCP as it is now compatible with Cloud ML, so you can model terrabytes of data using it.
GCP Built-in XGBoost algorithm https://cloud.google.com/ml-engine/docs/algorithms/xgboost-start
Scikit-Learn API Scikit-Learn Wrapper interface for XGBoost https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
# installing catboost and xgboost
# !pip3 install catboost --user
# !pip3 install xgboost --user
# Let's compare XGBoost to CatBoost
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
import catboost
print('catboost version:', catboost.__version__)
import xgboost
print('xgboost version:', xgboost.__version__)
Let's get an independent Titanic data set from the Vanderbilt University¶
titanic_df = pd.read_csv(
'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')
titanic_df.head()
# simple feature engineering
# strip first letter from cabin number if there
titanic_df['cabin'] = titanic_df['cabin'].replace(np.NaN, 'U')
titanic_df['cabin'] = [ln[0] for ln in titanic_df['cabin'].values]
titanic_df['cabin'] = titanic_df['cabin'].replace('U', 'Unknown')
titanic_df['cabin'].head()
# create isfemale field and use numerical values
titanic_df['isfemale'] = np.where(titanic_df['sex'] == 'female', 1, 0)
# drop features not needed for model
titanic_df = titanic_df[[f for f in list(titanic_df) if f not in ['sex', 'name', 'boat','body', 'ticket', 'home.dest']]]
# make pclass actual categorical column
titanic_df['pclass'] = np.where(titanic_df['pclass'] == 1, 'First',
np.where(titanic_df['pclass'] == 2, 'Second', 'Third'))
titanic_df['embarked'] = titanic_df['embarked'].replace(np.NaN, 'Unknown')
titanic_df.head()
# how many nulls do we have?
titanic_df.isna().sum()
# impute age to mean
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].mean())
titanic_df['age']
# SEED - play around with this variable as it will change winners
SEED = 1234 # try 0
CatBoost's Turn!¶
titanic_df.head()
# map categorical features
titanic_catboost_ready_df = titanic_df.dropna()
features = [feat for feat in list(titanic_catboost_ready_df) if feat != 'survived']
print(features)
titanic_categories = np.where(titanic_catboost_ready_df[features].dtypes != np.float)[0]
titanic_categories
from catboost import CatBoostClassifier
X_train, X_test, y_train, y_test = train_test_split(titanic_df[features],
titanic_df[['survived']],
test_size=0.3,
random_state=SEED)
params = {'iterations':5000,
'learning_rate':0.01,
'cat_features':titanic_categories,
'depth':3,
'eval_metric':'AUC',
'verbose':200,
'od_type':"Iter", # overfit detector
'od_wait':500, # most recent best iteration to wait before stopping
'random_seed': SEED
}
cat_model = CatBoostClassifier(**params)
cat_model.fit(X_train, y_train,
eval_set=(X_test, y_test),
use_best_model=True, # True if we don't want to save trees created after iteration with the best validation score
plot=True
);
# Confusion matrix
dval_predictions = cat_model.predict(X_test)
dval_predictions
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, [1 if p > 0.5 else 0 for p in dval_predictions])
plt.figure(figsize = (6,4))
plt.ticklabel_format(style='plain', axis='y', useOffset=False)
sns.set(font_scale=1.4)
sns.heatmap(cm, annot=True, annot_kws={"size": 16})
plt.show()
cat_model.get_feature_importance()
feat_import = [t for t in zip(features, cat_model.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df[feat_import_df['VarImp'] > 0]
XGBoost's Turn¶
Dummy/one-hot only for XGBoost¶
titanic_df.isnull().any()
def prepare_data_for_model(raw_dataframe, target_columns, drop_first = True, make_na_col = False):
# dummy all categorical fields
dataframe_dummy = pd.get_dummies(raw_dataframe, columns=target_columns,
drop_first=drop_first,
dummy_na=make_na_col)
return (dataframe_dummy)
# create dummy features
titanic_xgboost_ready_df = prepare_data_for_model(titanic_df, target_columns=['pclass', 'cabin', 'embarked'])
titanic_xgboost_ready_df = titanic_xgboost_ready_df.dropna()
list(titanic_xgboost_ready_df)
# split data into train and test portions and model
features = [feat for feat in list(titanic_xgboost_ready_df) if feat != 'survived']
X_train, X_test, y_train, y_test = train_test_split(titanic_xgboost_ready_df[features],
titanic_xgboost_ready_df[['survived']],
test_size=0.3,
random_state=SEED)
import xgboost as xgb
xgb_params = {
'max_depth':3,
'eta':0.01,
'silent':0,
'eval_metric':'auc',
'subsample': 0.8,
'colsample_bytree': 0.8,
'objective':'binary:logistic',
'seed' : SEED
}
dtrain = xgb.DMatrix(X_train, y_train, feature_names=X_train.columns.values)
dtest = xgb.DMatrix(X_test, y_test, feature_names=X_test.columns.values)
evals = [(dtrain,'train'),(dtest,'eval')]
xgb_model = xgb.train ( params = xgb_params,
dtrain = dtrain,
num_boost_round = 5000,
verbose_eval=200,
early_stopping_rounds = 500,
evals=evals,
maximize = True)
# get dataframe version of important feature for model
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
xgb_fea_imp.head(10)
# Confusion matrix
dval_predictions = xgb_model.predict(dtest)
dval_predictions
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, [1 if p > 0.5 else 0 for p in dval_predictions])
plt.figure(figsize = (6,4))
plt.ticklabel_format(style='plain', axis='y', useOffset=False)
sns.set(font_scale=1.4)
sns.heatmap(cm, annot=True, annot_kws={"size": 16})
plt.show()
Final Winners¶
xgb_model.best_score
cat_model.best_score_
Let's dive deep into CatBoost!¶
Initial release date: July 18, 2017 by Yandex researchers and is open sourced
pip install catboost
Quick start https://catboost.ai/docs/concepts/python-quickstart.html
CatBoostClassifier¶
from catboost.datasets import titanic
titanic_train, titanic_test = titanic()
print(titanic_train.head(3))
titanic_train.shape
titanic_test.shape
# pip install pandas-profiling
import pandas_profiling as pp
pp.ProfileReport(titanic_train)
# clean up NaNs
titanic_train.isnull().sum(axis=0)
# impute age to mean
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].mean())
titanic_train['Embarked'] = titanic_train['Embarked'].replace(np.nan, 'Unknown', regex=True)
from catboost import CatBoostClassifier
# data split
outome_name = 'Survived'
features_for_model =['Pclass', 'Sex', 'Age', 'Embarked']
# data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_train[features_for_model],
titanic_train[outome_name],
test_size=0.3,
random_state=1)
# tell catboost which are categorical columns
titanic_categories = np.where(X_train[features_for_model].dtypes != np.float)[0]
print('titanic_categories:', titanic_categories)
params = {'iterations':1000,
'learning_rate':0.01,
'cat_features':titanic_categories,
'depth':3,
'eval_metric':'AUC',
'verbose':200,
'od_type':"Iter", # overfit detector
'od_wait':500,
}
model_classifier = CatBoostClassifier(**params)
model_classifier.fit(X_train, y_train,
eval_set=(X_test, y_test),
use_best_model=True,
plot= True
);
# feature importance
feat_import = [t for t in zip(features_for_model, model_classifier.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df.head(20)
CatBoostRegressor¶
Boston house prices dataset¶
The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE).
from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston_dataset.keys()
for ln in boston_dataset.DESCR.split('\n'):
print(ln)
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head(10)
# Our target variable - Median value of owner-occupied homes in $1000s
boston['MEDV'] = boston_dataset.target
boston.head()
pp.ProfileReport(boston)
# clean up NaNs
boston.isnull().sum(axis=0)
from catboost import CatBoostRegressor
# data split
outome_name = 'MEDV'
features_for_model = [f for f in list(boston) if f not in [outome_name, 'TAX']]
# get categories and cast to string
boston_categories = np.where([boston[f].apply(float.is_integer).all() for f in features_for_model])[0]
print('boston_categories:', boston_categories)
# convert to values to string
for feature in [list(boston[features_for_model])[f] for f in list(boston_categories)]:
print(feature)
boston[feature] = boston[feature].to_string()
# data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston[features_for_model],
boston[outome_name],
test_size=0.3,
random_state=1)
params = {'iterations':5000,
'learning_rate':0.001,
'depth':3,
'loss_function':'RMSE',
'eval_metric':'RMSE',
'random_seed':55,
'cat_features':boston_categories,
'metric_period':200,
'od_type':"Iter",
'od_wait':20,
'verbose':True,
'use_best_model':True}
model_regressor = CatBoostRegressor(**params)
model_regressor.fit(X_train, y_train,
eval_set=(X_test, y_test),
use_best_model=True,
plot= True
);
# feature importance
feat_import = [t for t in zip(features_for_model, model_regressor.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df.head(20)
Show Notes
(pardon typos and formatting -these are the notes I use to make the videos)
XGBoost is one of the most powerful boosted models in existence until now... here comes CatBoost. Let's explore how it compares to XGBoost using Python and also explore CatBoost on both a classification dataset and a regression one. Let's have some fun!