Just as the original Titanic VHS was published in two video cassettes, this Titanic analysis is also being published in two posts.

In this post–part 2–I’m going to be exploring random forests for the first time, and I will compare it to the outcome of the logistic regression I did last time.


Random forest vs. Logistic regression

Last time I explained how logistic regression uses a link function transforms non-linear relationships into linear ones. This means that although the outcome is discrete, a logistic regression assumes that the decision boundary given the variables (“feature space” in machine learning terminology) is linear. A random forest works a little differently. A random forest is made up of a number of decision trees, where each decision tree makes a series of decisions based on a subset of the data until it predicts the ultimate outcome, in this case survival or death. Because single decision trees are prone to over fitting, random forests are used to compensate by including multiple trees based on subsets of the data. Essentially, both algorithms can be used to form expectations about who would have survived the Titanic.

Practically, there are a few important differences to note:
1. It’s not necessary to dummy code the categorical variables
2. You don’t need to explicitly test interactions between variables (example gender and class), these are automatically identified by the algorithm.

After cleaning the data last time we ended up with this Titanic dataset

load("titanic.RData")
head(titanic)
##   Ticket PassengerId Survived Pclass
## 1 110152         258        1      1
## 2 110152         505        1      1
## 3 110152         760        1      1
## 4 110413         559        1      1
## 5 110413         586        1      1
## 6 110413         263        0      1
##                                                       Name    Sex Age
## 1                                     Cherry, Miss. Gladys female  30
## 2                                    Maioni, Miss. Roberta female  16
## 3 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female  33
## 4                   Taussig, Mrs. Emil (Tillie Mandelbaum) female  39
## 5                                      Taussig, Miss. Ruth female  18
## 6                                        Taussig, Mr. Emil   male  52
##   SibSp Parch Cabin Embarked tID Fclass Sclass fem EmbC EmbQ    price
## 1     0     0   B77        S   0      1      0   1    0    0 3.361532
## 2     0     0   B79        S   0      1      0   1    0    0 3.361532
## 3     0     0   B77        S   0      1      0   1    0    0 3.361532
## 4     1     1   E67        S   0      1      0   1    0    0 3.279030
## 5     0     2   E68        S   0      1      0   1    0    0 3.279030
## 6     1     1   E67        S   0      1      0   0    0    0 3.279030

The complete dataset is a mix of the test and train Kaggle sets, indicated by the variable tID, where 0 = train and 1 = test

I cleaned the data in the previous post, but before I run randomForest I want to be sure all categorical variables are recognized as such by R.

str(titanic)
## 'data.frame':    1309 obs. of  18 variables:
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 1 1 1 2 2 2 3 3 682 683 ...
##  $ PassengerId: int  258 505 760 559 586 263 111 476 1227 1050 ...
##  $ Survived   : int  1 1 1 1 1 0 0 0 NA NA ...
##  $ Pclass     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 157 506 709 805 803 804 672 165 1134 930 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 1 1 1 1 1 2 2 2 2 2 ...
##  $ Age        : num  30 16 33 39 18 ...
##  $ SibSp      : int  0 0 0 1 0 1 0 0 0 0 ...
##  $ Parch      : int  0 0 0 1 2 1 0 0 0 0 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 42 44 42 136 137 136 54 3 53 176 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 4 ...
##  $ tID        : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ Fclass     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sclass     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fem        : int  1 1 1 1 1 0 0 0 0 0 ...
##  $ EmbC       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmbQ       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ price      : num  3.36 3.36 3.36 3.28 3.28 ...

It looks like Survived, Pclass, SibSp and Parch are coded as integers but this should be factor. Recode to factor

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)

str(titanic)
## 'data.frame':    1309 obs. of  18 variables:
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 1 1 1 2 2 2 3 3 682 683 ...
##  $ PassengerId: int  258 505 760 559 586 263 111 476 1227 1050 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 NA NA ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 157 506 709 805 803 804 672 165 1134 930 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 1 1 1 1 1 2 2 2 2 2 ...
##  $ Age        : num  30 16 33 39 18 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 2 1 2 1 1 1 1 ...
##  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 2 3 2 1 1 1 1 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 42 44 42 136 137 136 54 3 53 176 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 4 ...
##  $ tID        : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ Fclass     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sclass     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fem        : int  1 1 1 1 1 0 0 0 0 0 ...
##  $ EmbC       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmbQ       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ price      : num  3.36 3.36 3.36 3.28 3.28 ...


Random Forest

Now we try a random forest.

Load package

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

Re-split data into test and train

train <- titanic[titanic$tID==0,]
test  <- titanic[titanic$tID==1,]

RandomForest command takes a formula as its first input in the form of y ~ a + b + c. In this case the formula is easy to write by hand, but it gets tedious to write out each variable name if there are more than ~10. This formula automates the process.

smalldata <- train[,c("Survived","Pclass","Sex","Age","SibSp","Parch","Embarked")]
varNames <- names(smalldata)

# Exclude Survived
varNames <- varNames[!varNames %in% c("Survived")]
 
# add + sign between independent variables
varNames <- paste(varNames, collapse = "+")
 
# Add Survived and convert to an object of type formula 
formula_rf <- as.formula(paste("Survived", varNames, sep = " ~ "))

Run random forest

model2 <- randomForest(formula_rf, data=smalldata, ntree=100, # number of decision trees in forest
                         importance=TRUE, # importance = to assess relative importance of predictors 
                         proximity=TRUE) # proximity = how similar two observations are, i.e. how likely to end in same leaf on different trees
model2
## 
## Call:
##  randomForest(formula = formula_rf, data = smalldata, ntree = 100,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.85%
## Confusion matrix:
##     0   1 class.error
## 0 506  43  0.07832423
## 1 116 226  0.33918129

The results will vary slightly each time we run it because they are based on a random selection of cases, but the OOB (out of bag) error rate should be approximately 18%. With this number, the lower it is the better. In the train data there are 342 survivals and total of 891 passengers, amounting to a survival rate of 38.38%. For reference, if we assumed that every passenger in the train set died, we would have an error rate of 38%. That the actual error rate is lower than the “random” error rate, tells us the classifier is doing something right. Turning to the confusion matrix, we see that the classifier was better at predicting death than survival. It has a false negative rate (the first row) of approx. 9% and a false positive rate of approx. 34%.


Model assessment

We start with a visual assessment of trace plots of the three error rates to check whether the estimates converged.

plot(model2)

The green line represents the type 1 error (false positive), black line is OOB error rate, and red line is type 2 error (false negative). These lines look pretty stable after about 50 trees, so I’m satisfied with the decision to limit the forest to 100 trees.


When running the random forest we told the model to track the importance of each variable. This we can print in a table and plot:

# Table
data.frame(importance(model2, type=2))
##          MeanDecreaseGini
## Pclass           44.45275
## Sex             111.06166
## Age              52.97534
## SibSp            19.72475
## Parch            14.24875
## Embarked         11.70935
# Variable Importance Plot
varImpPlot(model2, sort = T)

both the plot and the table show that sex is by far the most important variable. What I don’t like about either the table or the plot is that they only show the relative importance of each variable, and not their combined importance a la margins.


Make predictions

model2$predicted.response <- predict(model2 ,smalldata)
table(model2$predicted.response)
## 
##   0   1 
## 620 271
Accuracy <- mean(model2$predicted.response == smalldata$Survived)
Accuracy
## [1] 0.8686869

Conclusions

Accuracy is 87%. This is slightly higher than logistic regression, and the run time (and data cleaning time) was noticeably faster. However, with logistic regression we could easily see the relationship each variable had to the outcome, and using margins we could visualize the relationship between variables (i.e. interactions). Each variable is working much more behind the scenes with random forest–even the ability to see the importance of each variable is by default turned off. So, the tl;dr is that random forest is better for predicting the outcome and logistic regression is better for modeling the effect of each variable.

This post can be found on GitHub