Get the "Applied Data Science Edge"!

The ViralML School

Fundamental Market Analysis with Python - Find Your Own Answers On What Is Going on in the Financial Markets

Web Work

Python Web Work - Prototyping Guide for Maker

Use HTML5 Templates, Serve Dynamic Content, Build Machine Learning Web Apps, Grow Audiences & Conquer the World!

Hot off the Press!

The Little Book of Fundamental Market Indicators

My New Book: "The Little Book of Fundamental Analysis: Hands-On Market Analysis with Python" is Out!

CatBoost vs XGBoost - A Gentle Introduction to CatBoost - Free Udemy Class

Introduction

XGBoost is one of the most powerful boosted models in existence until now... here comes CatBoost. Let's explore how it compares to XGBoost using Python and also explore CatBoost on both a classification dataset and a regression one. Let's have some fun!



If you liked it, please share it:

Code

CatBoost vs XGBoost Battle & CatBoost Introduction

CatBoost vs XGBoost - A CatBoost Introduction



Link to Udemy Class
In [1]:
from IPython.display import Image
Image(filename='CatBoost vs Xgboost.png' )
Out[1]:

Let's pit CatBoost against XGBoost in a friendly classificiation battle! Cat Fight Time

I got some of my best scores on Kaggle using it! At one point, I was ranked 185th and I thank XGBoost ( https://www.kaggle.com/amunategui ), lot's of others thanked XGBoost too. We still thank it today - it's integrated all over the place - scikit-learn, cloud providers, I use it everyday for customers in GCP as it is now compatible with Cloud ML, so you can model terrabytes of data using it.

GCP Built-in XGBoost algorithm https://cloud.google.com/ml-engine/docs/algorithms/xgboost-start

Scikit-Learn API Scikit-Learn Wrapper interface for XGBoost https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

In [2]:
# installing catboost and xgboost

# !pip3 install catboost --user
# !pip3 install xgboost --user
In [3]:
# Let's compare XGBoost to CatBoost

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split

import catboost
print('catboost version:', catboost.__version__)
import xgboost
print('xgboost version:', xgboost.__version__)
catboost version: 0.18
xgboost version: 0.72.1

Let's get an independent Titanic data set from the Vanderbilt University

In [4]:
titanic_df = pd.read_csv(
    'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')
titanic_df.head()
Out[4]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.00 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.00 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.00 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
In [5]:
# simple feature engineering

# strip first letter from cabin number if there
titanic_df['cabin'] = titanic_df['cabin'].replace(np.NaN, 'U') 
titanic_df['cabin'] = [ln[0] for ln in titanic_df['cabin'].values]
titanic_df['cabin'] = titanic_df['cabin'].replace('U', 'Unknown') 
titanic_df['cabin'].head()
Out[5]:
0    B
1    C
2    C
3    C
4    C
Name: cabin, dtype: object
In [6]:
# create isfemale field and use numerical values
titanic_df['isfemale'] = np.where(titanic_df['sex'] == 'female', 1, 0)

# drop features not needed for model 
titanic_df = titanic_df[[f for f in list(titanic_df) if f not in ['sex', 'name', 'boat','body', 'ticket', 'home.dest']]]

# make pclass actual categorical column
titanic_df['pclass'] = np.where(titanic_df['pclass'] == 1, 'First', 
                                np.where(titanic_df['pclass'] == 2, 'Second', 'Third'))


titanic_df['embarked'] = titanic_df['embarked'].replace(np.NaN, 'Unknown') 


titanic_df.head()
Out[6]:
pclass survived age sibsp parch fare cabin embarked isfemale
0 First 1 29.00 0 0 211.3375 B S 1
1 First 1 0.92 1 2 151.5500 C S 0
2 First 0 2.00 1 2 151.5500 C S 1
3 First 0 30.00 1 2 151.5500 C S 0
4 First 0 25.00 1 2 151.5500 C S 1
In [7]:
# how many nulls do we have?
titanic_df.isna().sum() 
Out[7]:
pclass        0
survived      0
age         263
sibsp         0
parch         0
fare          1
cabin         0
embarked      0
isfemale      0
dtype: int64
In [8]:
# impute age to mean
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].mean())
titanic_df['age']
Out[8]:
0       29.000000
1        0.920000
2        2.000000
3       30.000000
4       25.000000
5       48.000000
6       63.000000
7       39.000000
8       53.000000
9       71.000000
10      47.000000
11      18.000000
12      24.000000
13      26.000000
14      80.000000
15      29.881138
16      24.000000
17      50.000000
18      32.000000
19      36.000000
20      37.000000
21      47.000000
22      26.000000
23      42.000000
24      29.000000
25      25.000000
26      25.000000
27      19.000000
28      35.000000
29      28.000000
          ...    
1279    14.000000
1280    22.000000
1281    22.000000
1282    29.881138
1283    29.881138
1284    29.881138
1285    32.500000
1286    38.000000
1287    51.000000
1288    18.000000
1289    21.000000
1290    47.000000
1291    29.881138
1292    29.881138
1293    29.881138
1294    28.500000
1295    21.000000
1296    27.000000
1297    29.881138
1298    36.000000
1299    27.000000
1300    15.000000
1301    45.500000
1302    29.881138
1303    29.881138
1304    14.500000
1305    29.881138
1306    26.500000
1307    27.000000
1308    29.000000
Name: age, Length: 1309, dtype: float64
In [9]:
# SEED - play around with this variable as it will change winners
SEED = 1234 # try 0

CatBoost's Turn!

In [10]:
titanic_df.head()
Out[10]:
pclass survived age sibsp parch fare cabin embarked isfemale
0 First 1 29.00 0 0 211.3375 B S 1
1 First 1 0.92 1 2 151.5500 C S 0
2 First 0 2.00 1 2 151.5500 C S 1
3 First 0 30.00 1 2 151.5500 C S 0
4 First 0 25.00 1 2 151.5500 C S 1
In [11]:
# map categorical features
titanic_catboost_ready_df = titanic_df.dropna() 

features = [feat for feat in list(titanic_catboost_ready_df) if feat != 'survived']
print(features)
titanic_categories = np.where(titanic_catboost_ready_df[features].dtypes != np.float)[0]
titanic_categories
['pclass', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked', 'isfemale']
Out[11]:
array([0, 2, 3, 5, 6, 7])
In [12]:
from catboost import CatBoostClassifier 

X_train, X_test, y_train, y_test = train_test_split(titanic_df[features], 
                                                    titanic_df[['survived']], 
                                                    test_size=0.3, 
                                                     random_state=SEED)
 

params = {'iterations':5000,
        'learning_rate':0.01,
        'cat_features':titanic_categories,
        'depth':3,
        'eval_metric':'AUC',
        'verbose':200,
        'od_type':"Iter", # overfit detector
        'od_wait':500, # most recent best iteration to wait before stopping
        'random_seed': SEED
          }

cat_model = CatBoostClassifier(**params)
cat_model.fit(X_train, y_train,   
          eval_set=(X_test, y_test), 
          use_best_model=True, # True if we don't want to save trees created after iteration with the best validation score
          plot=True  
         );
 
0:	test: 0.8269924	best: 0.8269924 (0)	total: 67ms	remaining: 5m 35s
200:	test: 0.8437691	best: 0.8450862 (179)	total: 2.34s	remaining: 56s
400:	test: 0.8479008	best: 0.8484554 (369)	total: 4.44s	remaining: 50.9s
600:	test: 0.8509234	best: 0.8509511 (592)	total: 6.35s	remaining: 46.5s
800:	test: 0.8510205	best: 0.8514364 (786)	total: 8.43s	remaining: 44.2s
1000:	test: 0.8497865	best: 0.8520742 (851)	total: 10.3s	remaining: 41s
1200:	test: 0.8528645	best: 0.8530864 (1191)	total: 12.2s	remaining: 38.5s
1400:	test: 0.8538628	best: 0.8540014 (1359)	total: 15s	remaining: 38.6s
1600:	test: 0.8547363	best: 0.8552077 (1557)	total: 17.2s	remaining: 36.4s
1800:	test: 0.8551106	best: 0.8553602 (1787)	total: 19.3s	remaining: 34.2s
2000:	test: 0.8557762	best: 0.8557762 (2000)	total: 21.4s	remaining: 32s
2200:	test: 0.8568992	best: 0.8568992 (2197)	total: 23.7s	remaining: 30.2s
2400:	test: 0.8584244	best: 0.8586462 (2357)	total: 25.4s	remaining: 27.5s
2600:	test: 0.8599773	best: 0.8600882 (2591)	total: 27.7s	remaining: 25.5s
2800:	test: 0.8612528	best: 0.8612806 (2789)	total: 29.7s	remaining: 23.3s
3000:	test: 0.8619461	best: 0.8620570 (2876)	total: 32.1s	remaining: 21.4s
3200:	test: 0.8622789	best: 0.8626948 (3177)	total: 34.3s	remaining: 19.3s
3400:	test: 0.8620570	best: 0.8627225 (3345)	total: 36.7s	remaining: 17.3s
3600:	test: 0.8625007	best: 0.8628335 (3585)	total: 39.8s	remaining: 15.5s
3800:	test: 0.8633049	best: 0.8635267 (3778)	total: 43s	remaining: 13.6s
4000:	test: 0.8634158	best: 0.8639427 (3950)	total: 45.8s	remaining: 11.4s
4200:	test: 0.8631939	best: 0.8639427 (3950)	total: 48.7s	remaining: 9.26s
4400:	test: 0.8636099	best: 0.8639427 (3950)	total: 51.3s	remaining: 6.99s
Stopped by overfitting detector  (500 iterations wait)

bestTest = 0.8639426543
bestIteration = 3950

Shrink model to first 3951 iterations.
In [13]:
# Confusion matrix
dval_predictions = cat_model.predict(X_test)
dval_predictions

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, [1 if p > 0.5 else 0 for p in dval_predictions])

plt.figure(figsize = (6,4))
plt.ticklabel_format(style='plain', axis='y', useOffset=False)
sns.set(font_scale=1.4)
sns.heatmap(cm, annot=True, annot_kws={"size": 16}) 
plt.show()
In [14]:
cat_model.get_feature_importance()
Out[14]:
array([13.47392213, 14.06654777,  4.57348393,  4.04448054, 17.05781177,
        7.42358554,  7.38780932, 31.972359  ])
In [15]:
feat_import = [t for t in zip(features, cat_model.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df[feat_import_df['VarImp'] > 0]
Out[15]:
Feature VarImp
7 isfemale 31.972359
4 fare 17.057812
1 age 14.066548
0 pclass 13.473922
5 cabin 7.423586
6 embarked 7.387809
2 sibsp 4.573484
3 parch 4.044481

XGBoost's Turn

Dummy/one-hot only for XGBoost

In [16]:
titanic_df.isnull().any()
Out[16]:
pclass      False
survived    False
age         False
sibsp       False
parch       False
fare         True
cabin       False
embarked    False
isfemale    False
dtype: bool
In [17]:
def prepare_data_for_model(raw_dataframe, target_columns, drop_first = True, make_na_col = False):
    # dummy all categorical fields 
    dataframe_dummy = pd.get_dummies(raw_dataframe, columns=target_columns, 
                                     drop_first=drop_first, 
                                     dummy_na=make_na_col)
    return (dataframe_dummy)

# create dummy features 
titanic_xgboost_ready_df = prepare_data_for_model(titanic_df, target_columns=['pclass', 'cabin', 'embarked'])
titanic_xgboost_ready_df = titanic_xgboost_ready_df.dropna() 

list(titanic_xgboost_ready_df)
Out[17]:
['survived',
 'age',
 'sibsp',
 'parch',
 'fare',
 'isfemale',
 'pclass_Second',
 'pclass_Third',
 'cabin_B',
 'cabin_C',
 'cabin_D',
 'cabin_E',
 'cabin_F',
 'cabin_G',
 'cabin_T',
 'cabin_Unknown',
 'embarked_Q',
 'embarked_S',
 'embarked_Unknown']
In [18]:
# split data into train and test portions and model
features = [feat for feat in list(titanic_xgboost_ready_df) if feat != 'survived']
X_train, X_test, y_train, y_test = train_test_split(titanic_xgboost_ready_df[features], 
                                                 titanic_xgboost_ready_df[['survived']], 
                                                test_size=0.3, 
                                                 random_state=SEED)
 

import xgboost  as xgb
xgb_params = {
    'max_depth':3, 
    'eta':0.01, 
    'silent':0, 
    'eval_metric':'auc',
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective':'binary:logistic',
    'seed' : SEED
}

dtrain = xgb.DMatrix(X_train, y_train, feature_names=X_train.columns.values)
dtest = xgb.DMatrix(X_test, y_test, feature_names=X_test.columns.values)

evals = [(dtrain,'train'),(dtest,'eval')]
xgb_model = xgb.train ( params = xgb_params,
              dtrain = dtrain,
              num_boost_round = 5000,
              verbose_eval=200, 
              early_stopping_rounds = 500,
              evals=evals,
              maximize = True)
[0]	train-auc:0.712492	eval-auc:0.693187
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 500 rounds.
[200]	train-auc:0.882827	eval-auc:0.862319
[400]	train-auc:0.895702	eval-auc:0.863279
[600]	train-auc:0.908131	eval-auc:0.870968
[800]	train-auc:0.918307	eval-auc:0.87386
[1000]	train-auc:0.925438	eval-auc:0.875056
[1200]	train-auc:0.931388	eval-auc:0.87589
[1400]	train-auc:0.93583	eval-auc:0.875779
[1600]	train-auc:0.939955	eval-auc:0.874611
Stopping. Best iteration:
[1161]	train-auc:0.930458	eval-auc:0.876808

In [19]:
# get dataframe version of important feature for model 
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
xgb_fea_imp.head(10)
Out[19]:
feature importance
0 fare 3938
5 age 2916
4 isfemale 635
3 sibsp 567
8 embarked_S 376
7 pclass_Third 322
9 cabin_Unknown 289
1 parch 280
11 pclass_Second 152
2 embarked_Q 149
In [20]:
# Confusion matrix
dval_predictions = xgb_model.predict(dtest)
dval_predictions

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, [1 if p > 0.5 else 0 for p in dval_predictions])

plt.figure(figsize = (6,4))
plt.ticklabel_format(style='plain', axis='y', useOffset=False)
sns.set(font_scale=1.4)
sns.heatmap(cm, annot=True, annot_kws={"size": 16}) 
plt.show()

Final Winners

In [21]:
xgb_model.best_score
Out[21]:
0.876808
In [22]:
cat_model.best_score_
Out[22]:
{'learn': {'Logloss': 0.30515428206120315},
 'validation': {'Logloss': 0.4201347125001406, 'AUC': 0.8639426543175642}}

Let's dive deep into CatBoost!

Initial release date: July 18, 2017 by Yandex researchers and is open sourced

https://catboost.ai/

pip install catboost

Quick start https://catboost.ai/docs/concepts/python-quickstart.html

CatBoostClassifier

In [23]:
from catboost.datasets import titanic
titanic_train, titanic_test = titanic()

print(titanic_train.head(3))

titanic_train.shape
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
Out[23]:
(891, 12)
In [24]:
titanic_test.shape
Out[24]:
(418, 11)
In [ ]:
# pip install pandas-profiling
import pandas_profiling as pp
pp.ProfileReport(titanic_train)
In [25]:
# clean up NaNs
titanic_train.isnull().sum(axis=0)
Out[25]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [26]:
# impute age to mean
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].mean())
titanic_train['Embarked'] = titanic_train['Embarked'].replace(np.nan, 'Unknown', regex=True)
In [27]:
from catboost import CatBoostClassifier

# data split
outome_name = 'Survived'
features_for_model =['Pclass', 'Sex', 'Age', 'Embarked']


# data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_train[features_for_model], 
                                                 titanic_train[outome_name], 
                                                 test_size=0.3, 
                                                 random_state=1)

# tell catboost which are categorical columns
titanic_categories = np.where(X_train[features_for_model].dtypes != np.float)[0]
print('titanic_categories:', titanic_categories)
 
titanic_categories: [0 1 3]
In [28]:
params = {'iterations':1000,
        'learning_rate':0.01,
        'cat_features':titanic_categories,
        'depth':3,
        'eval_metric':'AUC',
        'verbose':200,
        'od_type':"Iter", # overfit detector
        'od_wait':500,  
         }

 

model_classifier = CatBoostClassifier(**params)
                       
model_classifier.fit(X_train, y_train, 
                     eval_set=(X_test, y_test),  
                     use_best_model=True, 
                     plot= True  
                    );
 
0:	test: 0.7554419	best: 0.7554419 (0)	total: 31.3ms	remaining: 31.3s
200:	test: 0.8107133	best: 0.8157431 (28)	total: 2.2s	remaining: 8.76s
400:	test: 0.8169082	best: 0.8169082 (343)	total: 4.01s	remaining: 6s
600:	test: 0.8160557	best: 0.8183007 (403)	total: 5.79s	remaining: 3.85s
800:	test: 0.8196363	best: 0.8213413 (738)	total: 7.89s	remaining: 1.96s
999:	test: 0.8202330	best: 0.8213413 (738)	total: 9.89s	remaining: 0us

bestTest = 0.8213412901
bestIteration = 738

Shrink model to first 739 iterations.
In [29]:
# feature importance 
feat_import = [t for t in zip(features_for_model, model_classifier.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df.head(20)
Out[29]:
Feature VarImp
1 Sex 56.915324
0 Pclass 26.666115
2 Age 10.183606
3 Embarked 6.234955

CatBoostRegressor

Boston house prices dataset

The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE).

In [30]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston_dataset.keys()
Out[30]:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
In [31]:
for ln in boston_dataset.DESCR.split('\n'):
    print(ln)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [32]:
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head(10)
Out[32]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395.60 12.43
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386.71 17.10
In [33]:
# Our target variable - Median value of owner-occupied homes in $1000s
boston['MEDV'] = boston_dataset.target
boston.head()
Out[33]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [ ]:
pp.ProfileReport(boston)
In [34]:
# clean up NaNs
boston.isnull().sum(axis=0)
Out[34]:
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64
In [35]:
from catboost import CatBoostRegressor

# data split
outome_name = 'MEDV'
features_for_model = [f for f in list(boston) if f not in [outome_name, 'TAX']]

# get categories and cast to string
boston_categories = np.where([boston[f].apply(float.is_integer).all() for f in features_for_model])[0]
print('boston_categories:', boston_categories)

# convert to values to string
for feature in [list(boston[features_for_model])[f] for f in list(boston_categories)]:
    print(feature)
    boston[feature] = boston[feature].to_string()


# data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston[features_for_model], 
                                                 boston[outome_name], 
                                                 test_size=0.3, 
                                                 random_state=1)




params = {'iterations':5000,
        'learning_rate':0.001,
        'depth':3,
        'loss_function':'RMSE',
        'eval_metric':'RMSE',
        'random_seed':55,
        'cat_features':boston_categories,
        'metric_period':200,  
        'od_type':"Iter",  
        'od_wait':20,  
        'verbose':True,
        'use_best_model':True}


model_regressor = CatBoostRegressor(**params)

model_regressor.fit(X_train, y_train, 
          eval_set=(X_test, y_test),  
          use_best_model=True,  
          plot= True   
         );
boston_categories: [3 8]
CHAS
RAD
Warning: Overfitting detector is active, thus evaluation metric is calculated on every iteration. 'metric_period' is ignored for evaluation metric.
0:	learn: 9.0062199	test: 9.5913153	best: 9.5913153 (0)	total: 3.52ms	remaining: 17.6s
200:	learn: 8.1097535	test: 8.6739924	best: 8.6739924 (200)	total: 398ms	remaining: 9.51s
400:	learn: 7.3671975	test: 7.9052743	best: 7.9052743 (400)	total: 733ms	remaining: 8.41s
600:	learn: 6.7332418	test: 7.2399556	best: 7.2399556 (600)	total: 1.12s	remaining: 8.17s
800:	learn: 6.1944857	test: 6.6637468	best: 6.6637468 (800)	total: 1.51s	remaining: 7.92s
1000:	learn: 5.7395102	test: 6.1776956	best: 6.1776956 (1000)	total: 1.81s	remaining: 7.24s
1200:	learn: 5.3595767	test: 5.7679251	best: 5.7679251 (1200)	total: 2.07s	remaining: 6.55s
1400:	learn: 5.0351836	test: 5.4117721	best: 5.4117721 (1400)	total: 2.33s	remaining: 5.99s
1600:	learn: 4.7588279	test: 5.1042392	best: 5.1042392 (1600)	total: 2.7s	remaining: 5.74s
1800:	learn: 4.5228107	test: 4.8422938	best: 4.8422938 (1800)	total: 3.11s	remaining: 5.53s
2000:	learn: 4.3186420	test: 4.6256193	best: 4.6256193 (2000)	total: 3.43s	remaining: 5.14s
2200:	learn: 4.1376143	test: 4.4359809	best: 4.4359809 (2200)	total: 3.66s	remaining: 4.66s
2400:	learn: 3.9849409	test: 4.2810937	best: 4.2810937 (2400)	total: 3.92s	remaining: 4.24s
2600:	learn: 3.8455645	test: 4.1391164	best: 4.1391164 (2600)	total: 4.18s	remaining: 3.86s
2800:	learn: 3.7227215	test: 4.0203071	best: 4.0203071 (2800)	total: 4.46s	remaining: 3.5s
3000:	learn: 3.6157683	test: 3.9192963	best: 3.9192963 (3000)	total: 4.72s	remaining: 3.14s
3200:	learn: 3.5211869	test: 3.8313423	best: 3.8313423 (3200)	total: 4.97s	remaining: 2.8s
3400:	learn: 3.4370885	test: 3.7550608	best: 3.7550608 (3400)	total: 5.26s	remaining: 2.47s
3600:	learn: 3.3596317	test: 3.6827074	best: 3.6827074 (3600)	total: 5.54s	remaining: 2.15s
3800:	learn: 3.2897361	test: 3.6224526	best: 3.6224526 (3800)	total: 5.86s	remaining: 1.85s
4000:	learn: 3.2286628	test: 3.5723766	best: 3.5723766 (4000)	total: 6.12s	remaining: 1.53s
4200:	learn: 3.1709888	test: 3.5216147	best: 3.5216147 (4200)	total: 6.37s	remaining: 1.21s
4400:	learn: 3.1177826	test: 3.4758259	best: 3.4758259 (4400)	total: 6.61s	remaining: 900ms
4600:	learn: 3.0655620	test: 3.4298636	best: 3.4298636 (4600)	total: 6.83s	remaining: 593ms
4800:	learn: 3.0189495	test: 3.3891197	best: 3.3891197 (4800)	total: 7.08s	remaining: 293ms
4999:	learn: 2.9734377	test: 3.3499927	best: 3.3499927 (4999)	total: 7.33s	remaining: 0us

bestTest = 3.349992734
bestIteration = 4999

In [36]:
# feature importance 
feat_import = [t for t in zip(features_for_model, model_regressor.get_feature_importance())]
feat_import_df = pd.DataFrame(feat_import, columns=['Feature', 'VarImp'])
feat_import_df = feat_import_df.sort_values('VarImp', ascending=False)
feat_import_df.head(20)
Out[36]:
Feature VarImp
11 LSTAT 44.676129
5 RM 27.315121
7 DIS 7.292907
4 NOX 5.959828
9 PTRATIO 4.406179
0 CRIM 4.102309
6 AGE 2.862897
2 INDUS 1.677943
10 B 1.504728
1 ZN 0.201959
3 CHAS 0.000000
8 RAD 0.000000

Show Notes

(pardon typos and formatting -
these are the notes I use to make the videos)

XGBoost is one of the most powerful boosted models in existence until now... here comes CatBoost. Let's explore how it compares to XGBoost using Python and also explore CatBoost on both a classification dataset and a regression one. Let's have some fun!