Yes, this is yet another post about using the open source Titanic dataset to predict whether someone would live or die.

At this point, there’s not much new I (or anyone) can add to accuracy in predicting survival on the Titanic, so I’m going to focus on using this as an opportunity to explore a couple of R packages and teach myself some new machine learning techniques. I will be doing this over two blog posts. In the first post, I will be using logistic regression to calculate how certain attributes of passengers like gender and cabin class are related to their odds of survival. This gives me a chance to explore the R package margins for computing marginal predicted probabilities, which is based on the Stata command by the same name. Having a lot of experience with Stata, I am excited to see how intuitive margins is, and what types of plots I can make with it. In the second post I will be using R’s randomForest package to compare its accuracy with predictions based on the logistic regression.

First, a quick introduction to the data and research question.


The data

The Titanic data is free to download from Kaggle, where they have split it into a training and a test set. The training set is a .csv file measuring the following 12 aspects of 891 passengers on the Titanic. The test data is exactly the same as the train set, except without the variable “Survived”.

Varname Description
PassengerId Passenger ID
Survived Survival (0 = No; 1 = Yes)
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name Name
Sex Sex
Age Age
SibSp Number of Siblings/Spouses Aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Passenger Fare (British pound)
Cabin Cabin
Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)



Data cleaning

The data is already fairly clean, but it has a few missings and some variables need to be modified before we can include them in an analysis.


Merge test and train data

We eventually want to use the test dataset (test.csv) to check the accuracy of our predictions based on the train dataset (train.csv), therefore we should clean both .csv files together. After all, it would be hard use a user-generated variable in our prediction model if it doesn’t exist in the dataset! We now load and merge the test data with the train data

test <- read.csv("test.csv")
test$tID <- 1 # test data is marked with a 1
test$Survived <- NA # test data is missing Survived
train <- read.csv("train.csv")
train$tID <- 0 # train data is marked with a 0
test <- test[,c(1,13,2:12)] # put variable names in same order as in train
titanic <- rbind(train, test)
rm(train,test)


Missing data

First, not all missings are coded as NA, so we convert these to NA

titanic[titanic == ""] <- NA

A quick glance at the data tells us that there are 263 missings for Age (20.09%), 1 missing for Fare (0.08%), 1014 missings for Cabin (77.46%), and 2 missing for Embarked (0.15%)

sapply(titanic, function(x) sum(is.na(x))) 
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1        1014           2 
##         tID 
##           0

We could try to impute some of the missings using multiple imputation, for example, but for the purposes of this project I will just go with quick and easy techniques:

  1. Remove Cabin - Generalized linear models by default use only the complete cases (those without missings on any variable) to conduct an analysis. If we include “Cabin” without imputing the missing values, we will only be performing your analysis on the 23% of the passengers that aren’t missing this information. At the same time, imputing data for a variable that has so many missings should raise red flags. The simplest way to solve the problem of missing data in this case is to remove the variable all together. If I was feeling really thorough I could test whether the passengers missing “Cabin” are different in any way from passengers not missing “Cabin”, but apparently I’m not feeling that thorough.
  2. Age - This also has quite a few missings. I will impute those missing values with the mean age for all other passengers (29.88).
  3. Fare - I will also impute missings with the mean (to be calculated–see below). It’s only one case, so shouldn’t have a huge effect on the outcome.
  4. Embarked - This I will impute with the mode, or most frequent port of departure (Southhampton). Again, only two cases are missing data on where they embarked so it shouldn’t affect our analysis too much.


Categorical to dummy variables

The variables Pclass, Sex, and Embarked are categorical. In order to include them in the analysis we need to convert them into dummy variables where each category is either represented by 0 or 1. For two category variables (e.g. Sex), this is just a matter of making one category represented by a 1 and the other by a 0. For three or more category variables (e.g. Pclass and Embarked), we will need to create N-1 dummy variables.

Convert Pclass into 1st class and 2nd class variables, will use 3rd class as reference category

titanic$Fclass <- as.integer(titanic$Pclass==1) # First class
titanic$Sclass <- as.integer(titanic$Pclass==2) # Second class

Convert gender into numeric variable, where 1 = women and 0 = men

titanic$fem <- as.integer(titanic$Sex=="female")

Convert embarked into categorical variables, with Southampton as reference category

titanic$Embarked[is.na(titanic$Embarked)] <- 'S' # fill 2 missing with mode
titanic$EmbC <- as.integer(titanic$Embarked=="C")
titanic$EmbQ <- as.integer(titanic$Embarked=="Q")


Fare

A look at the number of unique values for Ticket tells us that there are only 929 tickets, no missings, and 1309 observations in our data. This indicates that some of the tickets cover multiple passengers.

length(unique(titanic$Ticket))
## [1] 929

The Fare variable refers to the price of the ticket, thus if the ticket was for multiple passengers, the fare will also be the cost of entry for multiple people. Our task then becomes finding which fares represent multiple tickets, how many tickets they represent, and then to calculate a fare per person (price).

Here I create the variable “count” which counts the number of people admitted per ticket

aggdata <- titanic %>% group_by(Ticket) %>% summarize(count=n())
titanic <- merge(titanic,aggdata,by="Ticket")
rm(aggdata)

Price per person (Fare divided by count)

titanic$price <- titanic$Fare/titanic$count
titanic <- subset(titanic, select=-c(Fare,count)) # delete Fare and count variables

There are two final transformations necessary to working with price. First, we need to fill in the missing value with the mean:

# fill 1 missing value with mean
titanic$price[is.na(titanic$price)] <- mean(titanic$price, na.rm=T)

Second, price is positively skewed

hist(titanic$price) # see positive skew

so we will take the log to make it normally distributed.

titanic$price <- log(titanic$price)
titanic$price[is.infinite(titanic$price)] <- 0
hist(titanic$price) # not exactly normal but much better


Age

We still need to impute the missings for age and see if the variable is normally distributed before we can call our data clean.

titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm=T) 

Age also looks normally distributed (more or less), so we can stop here.

hist(titanic$Age)


Test and train

Finally, our data is clean and our variables have been made. We now want to split the data again into test and train so we can run our analyses on the train data and test them on the test set.

train <- titanic[titanic$tID==0,]
test  <- titanic[titanic$tID==1,]



Logisitc regression with predicted marginal probabilities

Now that the data is ready, we turn to our analysis. As previously stated, the goal here is to predict which passengers would have survived the titanic. In this post I’m using logistic regression and marginal predicted probabilities to visualize results.


Logistic regression

The first thing we want to do is select the variables we think will be important to the analysis. Here’s a reminder of the variables in our dataset.

names(train)
##  [1] "Ticket"      "PassengerId" "Survived"    "Pclass"      "Name"       
##  [6] "Sex"         "Age"         "SibSp"       "Parch"       "Cabin"      
## [11] "Embarked"    "tID"         "Fclass"      "Sclass"      "fem"        
## [16] "EmbC"        "EmbQ"        "price"

Ticket was a useful variable for merging, but probably not useful for determining survival. Same for PassengerID, Name1, and tID. Cabin is another variable that has so many missings it’s not useful. I’m not going to do extensive hypothesis testing, but based again on having seen the movie, I have a few expectations:

  1. Women and younger passengers will be more likely to survive than men and older passengers.
  2. First and second class passengers will be more likely to survive than third class passengers.
  3. The combination of being first and second class and female will yeild even higher odds of survival than either being first/second class or female on its own.

The first two hypotheses I test by including fem, Age, Fclass, and Sclass in the regression. The third hypothesis I test by interacting Fclass and fem as well as Sclass and fem.

smalldata <- subset(train, select=-c(Ticket,PassengerId,Pclass,Name,Sex,Cabin,Embarked,tID))
model1 <- glm(Survived ~Age+SibSp+Parch+Fclass+Sclass+fem+EmbC+EmbQ+price+fem*Fclass+fem*Sclass,family=binomial(link='logit'),data=smalldata)
summary(model1)
## 
## Call:
## glm(formula = Survived ~ Age + SibSp + Parch + Fclass + Sclass + 
##     fem + EmbC + EmbQ + price + fem * Fclass + fem * Sclass, 
##     family = binomial(link = "logit"), data = smalldata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0570  -0.6287  -0.4564   0.3734   2.5378  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.149144   0.514109  -2.235 0.025403 *  
## Age         -0.045891   0.008572  -5.354 8.62e-08 ***
## SibSp       -0.304910   0.110505  -2.759 0.005794 ** 
## Parch       -0.004907   0.123971  -0.040 0.968428    
## Fclass       1.465002   0.405326   3.614 0.000301 ***
## Sclass       0.237909   0.327208   0.727 0.467172    
## fem          1.830580   0.249507   7.337 2.19e-13 ***
## EmbC         0.468649   0.241653   1.939 0.052459 .  
## EmbQ         0.583574   0.309526   1.885 0.059379 .  
## price        0.245393   0.225603   1.088 0.276720    
## Fclass:fem   2.130893   0.673160   3.166 0.001548 ** 
## Sclass:fem   2.517509   0.570642   4.412 1.03e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  752.2  on 879  degrees of freedom
## AIC: 776.2
## 
## Number of Fisher Scoring iterations: 6


Results

A logistic regression, like all glms, uses a link function (in this case a logistic function) to transform a non-linear equation to a linear one, because linear regression is computationally easy to perform. What this amounts to is that positive coefficients should be read as an increase in the logged odds of suvival per unit increase in any given predictor variable, while negative coefficients can be interpreted as a decrease in the logged odds. There are ways to transform the logged odds to probabilities both manually and using the R predict() function, but we will get to that in a moment. First, let’s see what the general relationship is between predictor variables and log odds of survival.

First of all, it appears James Cameron was on the right track with his “women and children” scene. The positive and sigifnicant coefficient for “fem” tells us that third class women were more likely to survive than third class men. The positive and significant coefficient for the interaction between Sclass and fem tells us that second class women were both more likely to survive than third class women and second class men. Ditto for first class women. Second class men had no survival advantage over third class men as seen by the lack of significant effect for Sclass. Having siblings or a spouse on the boat decreased odds of survival, maybe because all those gallant Edwardian-era husbands were giving up their spots on a lifeboat for their wives. Sadly, there was no similar gallantry between parents and children. As predicted, the younger you were, the higher your odds of survival. Port of embarkment is not significant, and neither is ticket price.


Margins

A useful function in Stata that is, since March, 2017, also available in R is the margins function. Its purpose is to calculate marginal predicted probabilities, and there are two ways to calculate these. First, there is the Average Marginal Effects (AME) which is the default in the margins package. As far as I understand them, AMEs can be interpreted as the change in the probability of survival per unit change in each variable averaged across all cases. Marginal Effects at the Means (MEM) on the other hand, can be interpreted as the probability of survival for an observed value of a variable (Ex: First class), while holding the other variables constant at their means or other meaningful value (Ex: Age = 29, fem = 1). The main thing to keep in mind is that AMEs should be interpreted as a change in probability while MEMs should be interpreted as absolute probability. I now do a demonstration with AMEs

library(margins)
margins1 <- margins(model1)
summary(margins1)
##  factor     AME     SE       z      p   lower   upper
##     Age -0.0061 0.0011 -5.6298 0.0000 -0.0083 -0.0040
##    EmbC  0.0628 0.0321  1.9539 0.0507 -0.0002  0.1257
##    EmbQ  0.0781 0.0412  1.8982 0.0577 -0.0025  0.1588
##  Fclass  0.2934 0.0542  5.4168 0.0000  0.1873  0.3996
##     fem  0.3662 0.0218 16.7991 0.0000  0.3235  0.4090
##   Parch -0.0007 0.0166 -0.0396 0.9684 -0.0332  0.0319
##   price  0.0329 0.0301  1.0902 0.2756 -0.0262  0.0919
##  Sclass  0.1468 0.0353  4.1630 0.0000  0.0777  0.2159
##   SibSp -0.0408 0.0146 -2.7958 0.0052 -0.0694 -0.0122
plot(margins1, labels=c("Age","EmbC","EmbQ","Fclass","fem","Parch","price","Sclass","SibSp")) # I specify the x-axis labels because the plot() function didn't label my axis in the same order as the data. This way I can be sure each label lines up with the right AME 


Hypothesis testing

I had a hypothesis about how the effect of class would differ for men and women, so I’m going to use the margins command to illustrate that. By specifying the “at” argument, I calculate marginal effects of each variable for men and women separately.

margins2 <- margins(model1, at = list(fem = range(smalldata$fem)))
summary(margins2)
##  factor fem     AME     SE       z      p   lower   upper
##     Age   0 -0.0066 0.0012 -5.4318 0.0000 -0.0089 -0.0042
##     Age   1 -0.0070 0.0013 -5.5191 0.0000 -0.0094 -0.0045
##    EmbC   0  0.0671 0.0346  1.9413 0.0522 -0.0006  0.1349
##    EmbC   1  0.0711 0.0365  1.9504 0.0511 -0.0003  0.1426
##    EmbQ   0  0.0836 0.0448  1.8655 0.0621 -0.0042  0.1714
##    EmbQ   1  0.0886 0.0465  1.9053 0.0567 -0.0025  0.1797
##  Fclass   0  0.2099 0.0566  3.7108 0.0002  0.0990  0.3207
##  Fclass   1  0.5457 0.0958  5.6978 0.0000  0.3580  0.7334
##     fem   0  0.4399 0.0376 11.6908 0.0000  0.3662  0.5137
##     fem   1  0.3344 0.0384  8.7172 0.0000  0.2592  0.4096
##   Parch   0 -0.0007 0.0178 -0.0396 0.9684 -0.0355  0.0341
##   Parch   1 -0.0007 0.0188 -0.0396 0.9684 -0.0376  0.0361
##   price   0  0.0352 0.0323  1.0879 0.2766 -0.0282  0.0985
##   price   1  0.0372 0.0344  1.0820 0.2792 -0.0302  0.1047
##  Sclass   0  0.0341 0.0470  0.7250 0.4684 -0.0581  0.1262
##  Sclass   1  0.4181 0.0622  6.7240 0.0000  0.2963  0.5400
##   SibSp   0 -0.0437 0.0158 -2.7719 0.0056 -0.0746 -0.0128
##   SibSp   1 -0.0463 0.0165 -2.8108 0.0049 -0.0785 -0.0140

note in particular the Fclass and Sclass margins:

df <- as.data.frame(summary(margins2))
df <- df[(df$factor=="Fclass" | df$factor=="Sclass"),c(1:3,7:8)]
df
##    factor fem       AME       lower     upper
## 7  Fclass   0 0.2098888  0.09902971 0.3207479
## 8  Fclass   1 0.5456941  0.35798217 0.7334060
## 15 Sclass   0 0.0340849 -0.05805506 0.1262249
## 16 Sclass   1 0.4181477  0.29626340 0.5400320

I now want to plot the marginal effects of class by gender. I wasn’t able to figure out how to do so using the plot() or cplot() functions within the margins package, so I’m just going to do it in ggplot2. Easy peasy.

library(ggplot2)
df$fem <- as.factor(df$fem) # fem needs to be explicitly labeled as factor instead of integer for ggplot2 to recognize it as such
p <- ggplot(df, aes(factor,AME)) + 
      geom_bar(stat = "identity", aes(fill=fem), position = position_dodge(width = .9)) + 
      xlab("Class") + 
      ylab("Average marginal effects")
p

Remember, these are average marginal effects rather than marginal effects at the means. If you remember from the analysis, the reference category here is third class men. We can read this figure as the average increase in probability of survival for first and second class men and women above the probability of survival for third class men. Interactions can be unintuitive, and so even though all this information was also visible in the regression results, the figure shows us the role of class and gender in a way that’s easy to grasp. Here we can easily see, for example, how second class women have a higher probability of survival than first class men.



Overall predictive power

Now that we have and understand our results, we want to be able to use the logistic regression model to make predictions on a test set. After the post next week using Random Forest, I will be able to compare models. But for illustrative purposes now I’m going to assess the accuracy rate for how close our predictions are to the truth in our model. This may be somewhat misleading because I haven’t accounted for the possibility of overfitting the data. Nonetheless:

make predictions

predict_model1 = predict(model1, newdata = train,
                          type = 'response')
predict_model1 = ifelse(predict_model1 >0.5, 1, 0)
table(predict_model1)
## predict_model1
##   0   1 
## 616 275

From the 891 cases in the train data, 275 passengers have a probability greater than .5 that they will survive the shipwreck. Now I will test how accurate those predictions are compared to the actual values

Accuracy <- mean(predict_model1 == train$Survived)
Accuracy
## [1] 0.8148148



Conclusions

81% accurate. Next week I will see how logistic regression fares compared to Random Forest. My only conclusion at this point is that the margins package is fine for calculating margins, but not so good for graphing them.

This post can be found on GitHub


1. [As others have shown e.g. this anonymous data scientist, Name does have some useful information, namely title (unmarried woman, married woman, man, no title), but I’m not going to address that here.]