Get the "Applied Data Science Edge"!

The ViralML School

Fundamental Market Analysis with Python - Find Your Own Answers On What Is Going on in the Financial Markets

Hot off the Press!

The Little Book of Fundamental Market Indicators

My New Book: "The Little Book of Fundamental Analysis: Hands-On Market Analysis with Python" is Out!

Grow Your Web Brand, Visibility & Traffic Organically

The Little Book of Fundamental Market Indicators

5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up


Sign up for my newsletter and get my free intro class:

CatBoost vs XGBoost - A Gentle Introduction to CatBoost - Free Udemy Class

Introduction

XGBoost is one of the most powerful boosted models in existence until now... here comes CatBoost. Let's explore how it compares to XGBoost using Python and also explore CatBoost on both a classification dataset and a regression one. Let's have some fun!



If you liked it, please share it:

Code

CatBoost vs XGBoost Battle & CatBoost Introduction

CatBoost vs XGBoost - A CatBoost Introduction



Link to Udemy Class
In [1]:
from IPython.display import Image
Image(filename='CatBoost vs Xgboost.png' )
Out[1]:

Let's pit CatBoost against XGBoost in a friendly classificiation battle! Cat Fight Time

I got some of my best scores on Kaggle using it! At one point, I was ranked 185th and I thank XGBoost ( https://www.kaggle.com/amunategui ), lot's of others thanked XGBoost too. We still thank it today - it's integrated all over the place - scikit-learn, cloud providers, I use it everyday for customers in GCP as it is now compatible with Cloud ML, so you can model terrabytes of data using it.

GCP Built-in XGBoost algorithm https://cloud.google.com/ml-engine/docs/algorithms/xgboost-start

Scikit-Learn API Scikit-Learn Wrapper interface for XGBoost https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

In [2]:
# installing catboost and xgboost

# !pip3 install catboost --user
# !pip3 install xgboost --user
In [3]:
# Let's compare XGBoost to CatBoost

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split

import catboost
print('catboost version:', catboost.__version__)
import xgboost
print('xgboost version:', xgboost.__version__)
catboost version: 0.18
xgboost version: 0.72.1

Let's get an independent Titanic data set from the Vanderbilt University

In [4]:
titanic_df = pd.read_csv(
    'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')
titanic_df.head()
Out[4]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.00 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.00 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.00 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
In [5]:
# simple feature engineering

# strip first letter from cabin number if there
titanic_df['cabin'] = titanic_df['cabin'].replace(np.NaN, 'U') 
titanic_df['cabin'] = [ln[0] for ln in titanic_df['cabin'].values]
titanic_df['cabin'] = titanic_df['cabin'].replace('U', 'Unknown') 
titanic_df['cabin'].head()
Out[5]:
0    B
1    C
2    C
3    C
4    C
Name: cabin, dtype: object
In [6]:
# create isfemale field and use numerical values
titanic_df['isfemale'] = np.where(titanic_df['sex'] == 'female', 1, 0)

# drop features not needed for model 
titanic_df = titanic_df[[f for f in list(titanic_df) if f not in ['sex', 'name', 'boat','body', 'ticket', 'home.dest']]]

# make pclass actual categorical column
titanic_df['pclass'] = np.where(titanic_df['pclass'] == 1, 'First', 
                                np.where(titanic_df['pclass'] == 2, 'Second', 'Third'))


titanic_df['embarked'] = titanic_df['embarked'].replace(np.NaN, 'Unknown') 


titanic_df.head()
Out[6]:
pclass survived age sibsp parch fare cabin embarked isfemale
0 First 1 29.00 0 0 211.3375 B S 1
1 First 1 0.92 1 2 151.5500 C S 0
2 First 0 2.00 1 2 151.5500 C S 1
3 First 0 30.00 1 2 151.5500 C S 0
4 First 0 25.00 1 2 151.5500 C S 1
In [7]:
# how many nulls do we have?
titanic_df.isna().sum() 
Out[7]:
pclass        0
survived      0
age         263
sibsp         0
parch         0
fare          1
cabin         0
embarked      0
isfemale      0
dtype: int64
In [8]:
# impute age to mean
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].mean())
titanic_df['age']
Out[8]:
0       29.000000
1        0.920000
2        2.000000
3       30.000000
4       25.000000
5       48.000000
6       63.000000
7       39.000000
8       53.000000
9       71.000000
10      47.000000
11      18.000000
12      24.000000
13      26.000000
14      80.000000
15      29.881138
16      24.000000
17      50.000000
18      32.000000
19      36.000000
20      37.000000
21      47.000000
22      26.000000
23      42.000000
24      29.000000
25      25.000000
26      25.000000
27      19.000000
28      35.000000
29      28.000000
          ...    
1279    14.000000
1280    22.000000
1281    22.000000
1282    29.881138
1283    29.881138
1284    29.881138
1285    32.500000
1286    38.000000
1287    51.000000
1288    18.000000
1289    21.000000
1290    47.000000
1291    29.881138
1292    29.881138
1293    29.881138
1294    28.500000
1295    21.000000
1296    27.000000
1297    29.881138
1298    36.000000
1299    27.000000
1300    15.000000
1301    45.500000
1302    29.881138
1303    29.881138
1304    14.500000
1305    29.881138
1306    26.500000
1307    27.000000
1308    29.000000
Name: age, Length: 1309, dtype: float64
In [9]:
# SEED - play around with this variable as it will change winners
SEED = 1234 # try 0

CatBoost's Turn!

In [10]:
titanic_df.head()
Out[10]:
pclass survived age sibsp parch fare cabin embarked isfemale
0 First 1 29.00 0 0 211.3375 B S 1
1 First 1 0.92 1 2 151.5500 C S 0
2 First 0 2.00 1 2 151.5500 C S 1
3 First 0 30.00 1 2 151.5500 C S 0
4 First 0 25.00 1 2 151.5500 C S 1
In [11]:
# map categorical features
titanic_catboost_ready_df = titanic_df.dropna() 

features = [feat for feat in list(titanic_catboost_ready_df) if feat != 'survived']
print(features)
titanic_categories = np.where(titanic_catboost_ready_df[features].dtypes != np.float)[0]
titanic_categories
['pclass', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked', 'isfemale']
Out[11]:
array([0, 2, 3, 5, 6, 7])
In [12]:
from catboost import CatBoostClassifier 

X_train, X_test, y_train, y_test = train_test_split(titanic_df[features], 
                                                    titanic_df[['survived']], 
                                                    test_size=0.3, 
                                                     random_state=SEED)
 

params = {'iterations':5000,
        'learning_rate':0.01,
        'cat_features':titanic_categories,
        'depth':3,
        'eval_metric':'AUC',
        'verbose':200,
        'od_type':"Iter", # overfit detector
        'od_wait':500, # most recent best iteration to wait before stopping
        'random_seed': SEED
          }

cat_model = CatBoostClassifier(**params)
cat_model.fit(X_train, y_train,   
          eval_set=(X_test, y_test), 
          use_best_model=True, # True if we don't want to save trees created after iteration with the best validation score
          plot=True  
         );