Data and Statistical Analysis

Predictive Modelling Using R: Titanic

Data and Statistical Analysis

In previous posts, we talked about the advantages of using R, and demonstrated how to a solve brain teaser using R. R is also a powerful tool for predictive modelling. Predictive modelling uses available data and statistics to forecast outcomes. In this example, we use R to predict the fate of passengers on the Titanic.


Background

On April 15, 1912 the Titanic sunk 600km off the coast of Newfoundland. Highlighting this tragedy is the death of 1502 out of the 2224 passengers and crew. We can use R to build a model capable of predicting the fate of the passengers and crew. The data is available from Kaggle for download, and is already split into a train dataset and test dataset.

We will use R and the train dataset to build our model. Details about the train dataset are as follows:

str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

The test dataset contains all the variables in the train dataset, except that we have have to predict who survived ($ Survived). We can submit each of our predictions to Kaggle for scoring.


Gender and Age

“Women and children first.” That might spring to mind when thinking about who is more likely to survive. Here we take a look if that is true. Below is a graphical comparison of gender ($ Sex), age group (child vs adult, $ Age), and survival ($ Survival). It is immediately obvious that males had a higher fatality rate than females. Children were about equally as likely to survive as not, while adults disproportionately did not survive.

# add factor based on age
train$age.grp[train$Age < 18] <- "child" # classify any age below 18 as child
train$age.grp[train$Age >= 18] <- "adult" # classify any other age as adult
train$age.grp <- as.factor(train$age)

# plot of survival based on Age and Sex
qplot(Survived, Sex, data=train, geom="jitter", colour=age.grp)
+ scale_colour_manual(values=brewer.pal(4,"Spectral")) + theme_classic()
+ scale_x_continuous(breaks=c(0,1), labels=c("no", "yes"))

 


Gender and Class

Here we are taking a look at gender ($ Sex), class ($ Pclass) and survival ($ Survival). Glancing at the figure below it is clear that being in 1st, 2nd, or 3rd class affected your odds of survival. Third class passengers fared the worst, especially the males. It is striking that very few females in 1st or 2nd class perished.

# plot of survival based on passenger class and Sex
qplot(Survived, Sex, data=train, geom="jitter", colour=Pclass)
+ scale_colour_manual(values=brewer.pal(4,"Spectral")) + theme_classic()
+ scale_x_continuous(breaks=c(0,1), labels=c("no", "yes"))

 

We can see the difference class and gender made on the odds of survival, but how well do those variable alone predict survival? We can choose a simple model, make a prediction, and submit it to Kaggle for scoring.

fit <- lm(Survived ~ Sex + Pclass, data=train) # model
test$Survived <- round(predict(fit, test)) # predict survival from our model (rounded to 0 or 1)

Making a simple prediction based on just gender and class is decent enough to earn a Public Score on Kaggle of 0.76555.

 

Random Forests

We can improve our accuracy using a simple Random Forest model. Random Forests utilize many decision trees to construct a model. We will use all variables excluding the names of the passengers, their tickets, and cabin information. We also need to clean up any missing data. This results in a 0.77033 Public Score on Kaggle.

set.seed(1)
feature.names <- names(train)[-c(1,2,4,9,11)] # select variables for our prediction

# clean up the missing data
train$Age[is.na(train$Age)] <- -1
test$Age[is.na(test$Age)] <- -1

train$Fare[is.na(train$Fare)] <- median(train$Fare, na.rm=T)
test$Fare[is.na(test$Fare)] <- median(test$Fare, na.rm=T)

train$Embarked[is.na(train$Embarked)] <- "S"
test$Embarked[is.na(test$Embarked)] <- "S"

#treat character columns as factors
for (f in feature.names) {
  if (class(train[[f]])=="character") {
    levels <- unique(c(train[[f]], test[[f]]))
    train[[f]] <- as.integer(factor(train[[f]], levels=levels))
    test[[f]]  <- as.integer(factor(test[[f]],  levels=levels))
  }
}

forest.fit <- randomForest(train[,feature.names], factor(train$Survived), ntree=100, importance=T) # model
forest.pred <- predict(forest.fit, test[,feature.names]) # predict survival from our model

 

Next Steps

We have shown how R can be used for predictive modelling, using the Titanic as an example. With further feature engineering, such as parsing out each passengers title, and determining their family size, our score can exceed 0.8.


Banner photo credit: crqcbzfbobjiayscllyk 3279461836 via photopin (license)

Author


Avatar