The ViralML Show! Walkthrough of the dummyVars function from the {caret} package: Machine Learning with R

Walkthrough of the dummyVars function from the {caret} package: Machine Learning with R

Introduction

Walkthrough of the dummyVars function from the {caret} package: Machine Learning with R MORE: Signup for my newsletter and more: http://www.viralml.com Connect on Twitter: https://twitter.com/amunategui My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud: https://amzn.to/2PV3GCV Grow Your Web Brand, Visibility & Traffic Organically: 5 Years of amunategui.github.Io and the Lessons I Learned from Growing My Online Community from the Ground Up: Fringe Tactics - Finding Motivation in Unusual Places: Alternative Ways of Coaxing Motivation Using Raw Inspiration, Fear, and In-Your-Face Logic https://amzn.to/2DYWQas Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK Defense Against The Dark Digital Attacks: How to Protect Your Identity and Workflow in 2019: https://amzn.to/2Jw1AYS CATEGORY:DataScience

If you liked it, please share it:

Code

MVP Light Stack

logo

Brief Walkthrough Of The dummyVars Function From {caret}

Practical walkthroughs on machine learning, data exploration and finding insight.

Packages Used in this Walkthrough

{caret} - dummyVars function

As the name implies, the dummyVars function allows you to create dummy variables - in other words it translates text data into numerical data for modeling purposes.

If you are planning on doing predictive analytics or machine learning and want to use regression or any other modeling technique that requires numerical data, you will need to transform your text data into numbers otherwise you run the risk of leaving a lot of information on the table…

In R, there are plenty of ways of translating text into numerical data. You can do it manually, use a base function, such as matrix, or a packaged function like dummyVar from the caret package. One of the big advantages of going with the caret package is that it’s full of features, including hundreds of algorithms and pre-processing functions. Once your data fits into caret’s modular design, it can be run through different models with minimal tweaking.

Let’s look at a few examples of dummy variables. If you have a survey question with 5 categorical values such as very unhappy, unhappy, neutral, happy and very happy.

survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very happy'))

print(survey)

## service

## 1 very unhappy

## 2 unhappy

## 3 neutral

## 4 happy

## 5 very happy

You can easily translate this into a sequence of numbers from 1 to 5. Where 3 means neutral and, in the example of a linear model that thinks in fractions, 2.5 means somewhat unhappy, and 4.88 means very happy. So here we successfully transformed this survey question into a continuous numerical scale and do not need to add dummy variables - a simple rank column will do.

survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very happy'), rank=c(1,2,3,4,5))

print(survey)

## service rank

## 1 very unhappy 1

## 2 unhappy 2

## 3 neutral 3

## 4 happy 4

## 5 very happy 5

So, the above could easily be used in a model that needs numbers and still represent that data accurately using the ‘rank’ variable instead of ‘service’. But this only works in specific situations where you have somewhat linear and continuous-like data. What happens with categorical values such as marital status, gender, alive?

Does it make sense to be a quarter female? Or half single? Even numerical data of a categorical nature may require transformation. Take the zip code system. Does the half-way point between two zip codes make geographical sense? Because that is how a regression model would use it.

It may work in a fuzzy-logic way but it won’t help in predicting much; therefore we need a more precise way of translating these values into numbers so that they can be regressed by the model.

library(caret)

# check the help file for more details

?dummyVars

The dummyVars function breaks out unique values from a column into individual columns - if you have 1000 unique values in a column, dummying them will add 1000 new columns to your data set (be careful). Lets create a more complex data frame:

customers <- data.frame(

id=c(10,20,30,40,50),

gender=c('male','female','female','male','female'),

mood=c('happy','sad','happy','sad','happy'),

outcome=c(1,1,0,0,0))

And ask the dummyVars function to dummify it. The function takes a standard R formula: something ~ (broken down) by something else or groups of other things. So we simply use ~ . and the dummyVars will transform all characters and factors columns (the function never transforms numeric columns) and return the entire data set:

# dummify the data

dmy <- dummyVars(" ~ .", data = customers)

trsf <- data.frame(predict(dmy, newdata = customers))

print(trsf)

## id gender.female gender.male mood.happy mood.sad outcome

## 1 10 0 1 1 0 1

## 2 20 1 0 0 1 1

## 3 30 1 0 1 0 0

## 4 40 0 1 0 1 0

## 5 50 1 0 1 0 0

If you just want one column transform you need to include that column in the formula and it will return a data frame based on that variable only:

customers <- data.frame(

id=c(10,20,30,40,50),

gender=c('male','female','female','male','female'),

mood=c('happy','sad','happy','sad','happy'),

outcome=c(1,1,0,0,0))

dmy <- dummyVars(" ~ gender", data = customers)

trsf <- data.frame(predict(dmy, newdata = customers))

print(trsf)

## gender.female gender.male

## 1 0 1

## 2 1 0

## 3 1 0

## 4 0 1

## 5 1 0

The fullRank parameter is worth mentioning here. The general rule for creating dummy variables is to have one less variable than the number of categories present to avoid perfect collinearity (dummy variable trap). You basically want to avoid highly correlated variables but it also save space. If you have a factor column comprised of two levels ‘male’ and ‘female’, then you don’t need to transform it into two columns, instead, you pick one of the variables and you are either female, if its a 1, or male if its a 0.
Let’s turn on fullRank and try our data frame again:

customers <- data.frame(

id=c(10,20,30,40,50),

gender=c('male','female','female','male','female'),

mood=c('happy','sad','happy','sad','happy'),

outcome=c(1,1,0,0,0))

dmy <- dummyVars(" ~ .", data = customers, fullRank=T)

trsf <- data.frame(predict(dmy, newdata = customers))

print(trsf)

## id gender.male mood.sad outcome

## 1 10 1 0 1

## 2 20 0 1 1

## 3 30 0 0 0

## 4 40 1 1 0

## 5 50 0 0 0

As you can see, it picked male and sad, if you are 0 in both columns, then you are female and happy.

Things to keep in mind

Don't dummy a large data set full of zip codes; you more than likely don't have the computing muscle to add an extra 43,000 columns to your data set.

You can dummify large, free-text columns. Before running the function, look for repeated words or sentences, only take the top 50 of them and replace the rest with 'others'. This will allow you to use that field without delving deeply into NLP.

Hi there, this is Manuel Amunategui- if you're enjoying the content, don't forget to signup for my newsletter:

Full source code:

survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very happy'))

print(survey)

survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very happy'), rank=c(1,2,3,4,5))

print(survey)

library(caret)

?dummyVars # many options

customers <- data.frame(

id=c(10,20,30,40,50),

gender=c('male','female','female','male','female'),

mood=c('happy','sad','happy','sad','happy'),

outcome=c(1,1,0,0,0))

dmy <- dummyVars(" ~ .", data = customers)

trsf <- data.frame(predict(dmy, newdata = customers))

print(trsf)

print(str(trsf))

# works only on factors

customers$outcome <- as.factor(customers$outcome)

# tranform just gender

dmy <- dummyVars(" ~ gender", data = customers)

trsf <- data.frame(predict(dmy, newdata = customers))

print(trsf)

# use fullRank to avoid the 'dummy trap'

dmy <- dummyVars(" ~ .", data = customers, fullRank=T)

trsf <- data.frame(predict(dmy, newdata = customers))

print(trsf)

Show Notes

(pardon typos and formatting -
these are the notes I use to make the videos)

ViralML.com

Get the "Applied Data Science Edge"!

Web Work

Hot off the Press!

Walkthrough of the dummyVars function from the {caret} package: Machine Learning with R

Introduction

Code

Show Notes