Titanic analysis

Viet Le

Last updated on Dec 21, 2019

Titanic analysis

Viet Le

Last updated on Dec 21, 2019

Intro
Load library
Load data
Data exploration
- Dealing with NA
- Visualization
  - Fare
  - Pclass
  - Sex
Data transformation
Training
Evaluation
Using other set of predictors
Using other set of predictors
Using another set of predictions
Make other predictor
Summary

Intro

This is the analysis I used to create the prediction for Kaggle competition. The purpose of the analysis is to use machine learning to predict the survival probability of people on the Titanic. The dataset is provided as follow:

Dataset is divided into train and test set.
Passenger’s information such as Name, Age, Pclass, etc. is provided in both train and test set
The outcome (survival) indicates whether that passenger has survived or perished. This only presents in the training set.
There are some missing value in both data set

By using different machine learning algorithm, predictions can be made for people on the test dataset. In this project, I used a variaty of methods, some are just blackbox, some can be visualized fully or partially.

Load library

library(tidyverse)
library(doParallel)
library(caret)
library(lubridate)
library(patchwork)
library(caretEnsemble)
library(pROC)
library(partykit)

Load data

train0 <- read_csv("./titanic/train.csv",
                   col_types = cols(
                     Survived = col_factor(ordered = FALSE, include_na = T),
                     Pclass = col_factor(),
                     Sex = col_factor(),
                     Embarked = col_factor()
                   ))
intrain <- dim(train0)[1]
test0 <- read_csv("./titanic/test.csv",
                  col_types = cols(
                    Pclass = col_factor(),
                    Sex = col_factor(),
                    Embarked = col_factor()
                  ))

Data exploration

First glimpse at our data

glimpse(train0)

## Observations: 891
## Variables: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <fct> male, female, female, female, male, male, male, ma...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q,...

We can see that: - PassengerID is just a sequence of number to distinguish passengers, hence it has no predictive power.

Name follows a pattern with Familyname, Title and Firstname, it shows a potential to extract those components
SibSp and Parch: aren’t those in the same family have the same familyname? Possibly have a relationship with family name
Ticket does not follow any obvious pattern
A lot of missing values in Cabin

Dealing with NA

How many NA values?

in train set

train0 %>% map_df(~sum(is.na(.)))

## # A tibble: 1 x 12
##   PassengerId Survived Pclass  Name   Sex   Age SibSp Parch Ticket  Fare
##         <int>    <int>  <int> <int> <int> <int> <int> <int>  <int> <int>
## 1           0        0      0     0     0   177     0     0      0     0
## # ... with 2 more variables: Cabin <int>, Embarked <int>

in test set

test0 %>% map_df(~sum(is.na(.)))

## # A tibble: 1 x 11
##   PassengerId Pclass  Name   Sex   Age SibSp Parch Ticket  Fare Cabin
##         <int>  <int> <int> <int> <int> <int> <int>  <int> <int> <int>
## 1           0      0     0     0    86     0     0      0     1   327
## # ... with 1 more variable: Embarked <int>

Dealing with NA:

Remove Cabin as there are too much NA
Using median for Fare, only 1 value is missing
Using the highest frequency value for Embarked
The only predictor relating to Age is Name (Title e.g Mr, Miss, etc), so missing Age is replace by the mean of Age of people with the same title.
Other methods like knnImpute or BagImpute or even some models which can handle NA values can be used to impute. However, those seems unnecessary complicated methods since they consider all other predictors to find NA values, We certainly know that only some predictors are directly relating predictors with missing NA.

Visualization

Fare

p1 <- train0 %>% ggplot() + geom_histogram(aes(Fare, fill = Survived)) + labs(title = "Histogram of Fare \nin relation with Survived")
p2<- train0 %>% ggplot() + geom_histogram(aes(Fare, fill = Survived), position = "fill") + labs(title = "Percentage of Fare \nin relation with Survived")
p1 + p2

Distibution of Fare is skewed to the right, transform it with log10 to normalize

p1 <- train0 %>% ggplot() + geom_histogram(aes(log10(Fare+1), fill = Survived)) + labs(title = "Histogram of Fare \nin relation with Survived")
p2<- train0 %>% ggplot() + geom_histogram(aes(log10(Fare+1), fill = Survived), position = "fill") + labs(title = "Percentage of Fare \nin relation with Survived")
p1 + p2

There is a patern in Fare in relation with Survived. More expensive ticket, higher chance to Survive (to an extend).

Good predictor

Pclass

train0 %>% ggplot() + geom_bar(aes(fct_relevel(Pclass, "1", "2"), fill = Survived)) + scale_x_discrete(name = "Pclass")

Pclass 1 > Pclass 2 > Pclass 3 in survival rate and there are much more people in Pclass 3 than Pclass 1 and 2.

train0 %>% ggplot() + geom_boxplot(aes(fct_reorder(Pclass,Fare), Fare)) + scale_x_discrete(name = "Pclass")

The cost of Ticket increases from PClass 3 (cheapest) to Pclass 1 (most expensive).

This makes sense as the upper class who stayed in Pclass 1 would have more money hence have better chance to survive.

Good predictor

Sex

train0 %>% ggplot() + geom_bar(aes(Sex, fill = Survived))

Female have better survival rate as women and children are usually being rescued first.

Good predictor

Data transformation

Combine test and train for feature selection

combined <- bind_rows(train0, test0)

Recode survive for caret train as 1 and 0 are not valid level name in R

combined$Survived <- fct_recode(combined$Survived, "Lived" = "1", "Died" = "0") %>%
  fct_relevel("Lived")

Extract Title

combined <- combined %>% mutate(Title = str_extract(Name, ", (\\w)+")) %>%
  mutate(Title = str_extract(Title, "\\w+"))

Remove those title different from Miss, Master, Mr and Mrs since they are low frequency

combined %>% ggplot() + geom_bar(aes(Title))

Then combine them into Others

combined <- combined %>%
  mutate(Title = ifelse(Title %in% c("Miss", "Master", "Mr", "Mrs"), Title,"Others"))

How this Title predictor look

combined[1:intrain,] %>% ggplot() + geom_bar(aes(Title, fill = fct_relevel(Survived, "Died"))) + labs(fill = "Survived")

Good predictor.
In combination with Pclass, we can see a more detail picture.

combined[1:intrain,] %>%
    group_by(Title, Pclass, Survived) %>%
    summarise(each = n()) %>% ungroup() %>% group_by(Title, Pclass) %>% mutate(Tot = sum(each)) %>%
  mutate(Survivalrate = each/Tot) %>%
  ungroup() %>% filter(Survived == "Lived") %>%
  ggplot() + geom_tile(aes(fct_relevel(Pclass,"1","2"), fct_relevel(Title,"Master", "Miss", "Mrs", "Mr"), fill = Survivalrate)) + scale_fill_viridis_c() + xlab("Pclass") + ylab("Title")

Make Missingage predictor before we impute Age

combined <- combined %>% mutate(MissingAge = is.na(Age))

Use Title to predict Age

Master are underage male children(Fact). Take mean Age from train set and use it to replace NA from all data with Master title

meanMasterAge <- combined %>%
  slice(1:dim(train0)[1]) %>%
  filter(Title == "Master") %>%
  summarise(mean(Age, na.rm = T)) %>%
  pull()
combined <- combined %>%
  mutate(Age = ifelse(Title == "Master" & is.na(Age), meanMasterAge ,Age ))

Impute Age based on other Title

combined[is.na(combined$Age) & combined$Title == "Mr",]$Age <- mean(combined[1:intrain,][combined[1:intrain,]$Title == "Mr",]$Age, na.rm = T)
combined[is.na(combined$Age) & combined$Title == "Miss",]$Age <- mean(combined[1:intrain,][combined[1:intrain,]$Title == "Miss",]$Age, na.rm = T)
combined[is.na(combined$Age) & combined$Title == "Mrs",]$Age <- mean(combined[1:intrain,][combined[1:intrain,]$Title == "Mrs",]$Age, na.rm = T)
combined[is.na(combined$Age) & combined$Title == "Others",]$Age <- mean(combined[1:intrain,][combined[1:intrain,]$Title == "Others",]$Age, na.rm = T)

Impute Embarked by median

combined[is.na(combined$Embarked),]$Embarked <- "S"

Impute Fare by mean

combined[is.na(combined$Fare),]$Fare <- mean(combined[1:intrain,][combined[1:intrain,]$Pclass == 3,]$Fare, na.rm = T)

Family name

combined <- combined %>% mutate(Familyname = str_extract(Name, "^.+,")) %>%
  mutate(Familyname = str_sub(Familyname,1, -2))

Transform Fare to log(Fare)

combined <- combined %>% mutate(Fare = log10(Fare +1))

Make groupsize predictor

A. There are people with same Ticket but different Family name B. There are people with same Familyname but different Ticket C. There are people with same Familyname and same Ticket

The groupsize of a person is determined by number of other people with same ticket and family name. (A + B - C)

combined <- combined %>% group_by(Ticket, Familyname) %>% mutate(nAll = n()) %>%
  ungroup() %>% group_by(Ticket) %>% mutate(nTicket = n()) %>%
  ungroup() %>% group_by(Familyname) %>% mutate(nFamilyname = n()) %>%
  ungroup() %>% mutate(Groupsize = nTicket + nFamilyname - nAll)

combined <- combined %>% select(-nTicket, -nFamilyname)

How this new groupsize predictor look

combined[1:intrain,] %>%
  ggplot() + geom_bar(aes(Groupsize, fill = fct_relevel(Survived, "Died")), position = "fill")

combined[1:intrain,] %>%
    group_by(Pclass,Groupsize, Title, Survived) %>%
    summarise(each = n()) %>% ungroup() %>% group_by(Pclass, Groupsize, Title) %>% mutate(Tot = sum(each)) %>%
  mutate(Survivalrate = each/Tot) %>%
  ungroup() %>% filter(Survived == "Lived") %>%
  ggplot() + geom_tile(aes(fct_relevel(Title, "Master", "Miss", "Mrs"), Groupsize, fill = Survivalrate)) + scale_fill_viridis_c() + xlab("Title") + ylab("Groupsize") + facet_wrap(.~fct_relevel(Pclass, "1", "2"))

Some pattern but not really clear.

Final look

glimpse(combined)

## Observations: 1,309
## Variables: 17
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <fct> Died, Lived, Lived, Lived, Died, Died, Died, Died,...
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <fct> male, female, female, female, male, male, male, ma...
## $ Age         <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, ...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 0.9164539, 1.8590380, 0.9506082, 1.7331973, 0.9566...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q,...
## $ Title       <chr> "Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Mas...
## $ MissingAge  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FA...
## $ Familyname  <chr> "Braund", "Cumings", "Heikkinen", "Futrelle", "All...
## $ nAll        <int> 1, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6,...
## $ Groupsize   <int> 2, 2, 1, 2, 2, 3, 3, 5, 6, 2, 3, 2, 1, 11, 1, 1, 6...

Training

Separate train and test set since we are done with Data transformation

Separate 20% of the train set into holdout set to evaluate the models, it is more convenient than using using upload test set to Kaggle everytime.

set.seed(12345)
notinholdout <- createDataPartition(train0$Survived, p = 0.8, list = F)

alltrain <- combined[1:intrain, ]

train <- combined[1:intrain, ] %>% slice(notinholdout)
holdout <- combined[1:intrain, ] %>% slice(-notinholdout)

test <- combined[-(1:intrain), ]

Validation folds

5 times repeated 10 folds is used to evaluate model’s parameters. It is usually specified inside the trainControl but separate this step is a requirement for caretEnsemble to make sure all folds are consistent for ensemble.

index <- createMultiFolds(train$Survived, k = 10, times = 5)

Traincontrol

The index will take over other parameters, supply them just to get the labels right. We will use the metric ROC for this, so twoClassSummary and classProbs must be specified

trCtr_grid <- trainControl(method = "repeatedcv", repeats = 5,
                           number = 5,
                           index = index, savePredictions = "final", summaryFunction = twoClassSummary,
                           search = "grid", # By default  
                           verboseIter=TRUE, classProbs = T)

trCtr_none <- trainControl(method = "none", classProbs = T)

= ## Setup parallel computing

cl <- makeCluster(7)
registerDoParallel(cl)

Get model info if needed

getModelInfo()$xgbTree$parameters

##          parameter   class                          label
## 1          nrounds numeric          # Boosting Iterations
## 2        max_depth numeric                 Max Tree Depth
## 3              eta numeric                      Shrinkage
## 4            gamma numeric         Minimum Loss Reduction
## 5 colsample_bytree numeric     Subsample Ratio of Columns
## 6 min_child_weight numeric Minimum Sum of Instance Weight
## 7        subsample numeric           Subsample Percentage

Choose Hyperparameters

C5.0grid <- expand.grid(.trials = c(1:9, (1:10)*10),
                       .model = c("tree", "rules"),
                       .winnow = c(TRUE, FALSE))

SVMgrid <- expand.grid(sigma = c(0, 0.01, 0.04, 0.2),
                       C= c(seq(0,1,0.2),10,500))

XGBgrid <- expand.grid(nrounds = 100, # Fixed. depend on datasize
                       max_depth = 6, # More will make model more complex and more likely to overfit. Choose only this due to computational barrier
                       eta = c(0.01,0.05, 0.1), # NEED FINE TUNE
                       gamma = 0, # it is usually OK to leave at 0
                       min_child_weight = c(1,2,3), # The higher value, the more conservative model is, NEED FINE TUNE
                       colsample_bytree = c(.4, .7, 1), # subsample by columns
                       subsample = 1) # subsample by row leave at 1 since we doing k-fold

rpartgrid <- expand.grid(cp = runif(30,0,0.5))

rfgrid <- expand.grid(mtry = 1:8)

Train models

Choose predictors

formula <- as.formula(Survived ~ Pclass + Sex + Title +
                        Age + MissingAge + SibSp + Parch + Groupsize +
                        Fare + Embarked)

We will use caret ensemble for easy training

set.seed(67659)

modelList <- caretList(
  formula, data = train,
  trControl=trCtr_grid,
  metric = "ROC",
  tuneList=list(
    rf=caretModelSpec(method="rf", tuneGrid= rfgrid),
    SVM=caretModelSpec(method="svmRadial", tuneGrid = SVMgrid,
                       preProcess = c("scale", "center")),
    xgb=caretModelSpec(method="xgbTree", tuneGrid = XGBgrid),
    rpart= caretModelSpec(method = "rpart", tuneGrid = rpartgrid),
    C5.0 = caretModelSpec(method = "C5.0", tuneGrid = C5.0grid)
    )
)

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 6 on full training set

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 0.2, C = 0.4 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 6, eta = 0.05, gamma = 0, colsample_bytree = 0.4, min_child_weight = 2, subsample = 1 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0228 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting trials = 100, model = rules, winnow = FALSE on full training set

Evaluation

Evaluate in train data

dotplot(resamples(modelList))

rpart seems to be the worst model.

Let’s see if there is any significant different between them

diff(resamples(modelList), metric = "ROC") %>% summary()

## 
## Call:
## summary.diff.resamples(object = .)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## ROC 
##       rf        SVM       xgb       rpart     C5.0     
## rf               0.019479 -0.007292  0.073719 -0.002654
## SVM   8.044e-05           -0.026771  0.054240 -0.022133
## xgb   0.01167   6.282e-07            0.081011  0.004639
## rpart < 2.2e-16 2.825e-11 < 2.2e-16           -0.076373
## C5.0  1.00000   4.563e-05 1.00000   < 2.2e-16

from here we can see that rpart and SVM are actually worse than the rest of models.

Any correlation between models

modelCor(resamples(modelList))

##              rf       SVM       xgb     rpart      C5.0
## rf    1.0000000 0.8413727 0.9532823 0.7763147 0.9494574
## SVM   0.8413727 1.0000000 0.8049344 0.6800322 0.8247340
## xgb   0.9532823 0.8049344 1.0000000 0.7609024 0.9228143
## rpart 0.7763147 0.6800322 0.7609024 1.0000000 0.7549545
## C5.0  0.9494574 0.8247340 0.9228143 0.7549545 1.0000000

all model are all correlated, which might not results in any improvement when trying to ensemble models. This is not unexpected as those model are quite strong. So we will not use any ensemble for now.

Let’s have a look at Importance of predictors in tree-based model

ggplot(varImp(modelList$rf)) + labs(title = "Random Forest") +
ggplot(varImp(modelList$C5.0)) + labs(title = "C5.0") +
ggplot(varImp(modelList$xgb)) + labs(title = "XGB") +
ggplot(varImp(modelList$rpart)) + labs(title = "rpart")

MissingAge predictors does not seem to do well across most models, in contrast, Groupsize perform really well.

Let’s visualize a tree in rpart model

plot(as.party(modelList$rpart$finalModel))

What a simple tree that can achieve almost 80% accuracy on unseen data (next part)

Evaluate in holdout data

Accuracy check:

map(modelList, ~predict(., newdata = holdout)) %>%
  map( ~ confusionMatrix(holdout$Survived, .)) %>%
  map_df(~.$overall["Accuracy"])

## # A tibble: 1 x 5
##      rf   SVM   xgb rpart  C5.0
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.808 0.825 0.819 0.797 0.825

ROC and AUC

modROC <- map(modelList, ~predict(. , newdata = holdout, type = "prob")) %>%
  map(~roc(predictor = .x$Lived,
           response = holdout$Survived,
           levels = rev(levels(holdout$Survived)),
           print.auc = TRUE)
)

aucROC <- modROC %>% map_df(~.$auc) %>% gather(key = "model", value = "ROC") %>%
  arrange(desc(ROC)) %>%
  mutate(ROC = round(ROC, digits = 3)) %>% unite("auc", model, ROC, sep = "'s AUC: ") %>%
  pull() %>% str_c(collapse = "\n")

ggroc(modROC) + geom_text(aes(x = 0.5, y = 0.5, label = aucROC), color = "black") + scale_color_discrete(name = "Model")

Ranking on best performance:

C5.0
Random Forest
Extreme gradient boosting
Supportive vector machines
Decision tree

Evaluate on test data

Extract best tunes from models then train them again using all training data with no resampling method. After that, predict data on test data and print csv out to put on kaggle

set.seed(6759)

rf_f <- train(formula, data = alltrain, trControl = trCtr_none, method = "rf",
              tuneGrid = modelList$rf$bestTune)

SVM_f <- train(formula, data = alltrain, trControl = trCtr_none, method = "svmRadial",
               tuneGrid = modelList$SVM$bestTune, preProcess = c("scale", "center"))

xgb_f <- train(formula, data = alltrain, trControl = trCtr_none, method = "xgbTree",
               tuneGrid = modelList$xgb$bestTune)

rpart_f <- train(formula, data = alltrain, trControl = trCtr_none, method = "rpart",
                 tuneGrid = modelList$rpart$bestTune)

C5.0_f <- train(formula, data = alltrain, trControl = trCtr_none, method = "C5.0",
                tuneGrid = modelList$C5.0$bestTune)

all <- list(rf = rf_f, SVM = SVM_f, xgb = xgb_f, rpart = rpart_f, C5.0 = C5.0_f)

finaltest <- predict(all, newdata = test, na.action = na.pass)

prefix <- "all_ROC"

for (i in names(finaltest)){
  bind_cols(PassengerID = test0$PassengerId, Survived = finaltest[[i]]) %>%
    mutate(Survived = fct_recode(Survived, "1" = "Lived", "0" = "Died")) %>%
    write_csv(str_c("./prediction/", prefix,"_" ,i,"_", today(), ".csv"))
}

Results are

Extreme Gradient Boosting at 0.77990
Supportive Vector Machines at 0.78947
Random Forest at 0.76555
C5.0 at 0.78468
Single Decision tree at 0.77511

Using other set of predictors

Let’s see if we can improve the results from test set by using other set of predictors on to train on all data available

formula1 <- as.formula(Survived ~ Pclass + Sex + Title +
                        Groupsize +
                        Fare + Embarked)


set.seed(67659)

modelList1 <- caretList(
  formula1, data = alltrain,
  trControl=trCtr_grid,
  metric = "ROC",
  tuneList=list(
    rf=caretModelSpec(method="rf", tuneGrid= rfgrid),
    SVM=caretModelSpec(method="svmRadial", tuneGrid = SVMgrid,
                       preProcess = c("scale", "center")),
    xgb=caretModelSpec(method="xgbTree", tuneGrid = XGBgrid),
    rpart= caretModelSpec(method = "rpart", tuneGrid = rpartgrid),
    C5.0 = caretModelSpec(method = "C5.0", tuneGrid = C5.0grid)
    )
)

finaltest <- modelList1 %>% map(~ predict(., newdata = test, na.action = na.pass))

prefix <- "all_ROC_1"

for (i in names(finaltest)){
  bind_cols(PassengerID = test0$PassengerId, Survived = finaltest[[i]]) %>%
    mutate(Survived = fct_recode(Survived, "1" = "Lived", "0" = "Died")) %>%
    write_csv(str_c("./prediction/", prefix,"_" ,i,"_", today(), ".csv"))
}

Results are:

Extreme Gradient Boosting: 0.77511
Supportive vector machine: 0.79904
Random Forest: 0.80382
C5.0: 0.77511
Single decision tree: 0.77511

We can see a significant increase in accuracy using Random Forest with less predictors hence less overfitting. A slight improvement can be observed in other models except for C5.0

Using other set of predictors

formula2 <- as.formula(Survived ~ Pclass + Title +
                         Groupsize +
                        Fare + Embarked)

set.seed(67659)
modelList2 <- caretList(
  formula2, data = alltrain,
  trControl=trCtr_grid,
  metric = "ROC",
  tuneList=list(
    rf=caretModelSpec(method="rf", tuneGrid= rfgrid),
    SVM=caretModelSpec(method="svmRadial", tuneGrid = SVMgrid,
                       preProcess = c("scale", "center")),
    xgb=caretModelSpec(method="xgbTree", tuneGrid = XGBgrid),
    rpart= caretModelSpec(method = "rpart", tuneGrid = rpartgrid),
    C5.0 = caretModelSpec(method = "C5.0", tuneGrid = C5.0grid)
    )
)

finaltest <- modelList2 %>% map(~ predict(., newdata = test, na.action = na.pass))

prefix <- "all_ROC_2"

for (i in names(finaltest)){
  bind_cols(PassengerID = test0$PassengerId, Survived = finaltest[[i]]) %>%
    mutate(Survived = fct_recode(Survived, "1" = "Lived", "0" = "Died")) %>%
    write_csv(str_c("./prediction/", prefix,"_" ,i,"_", today(), ".csv"))
}

Results are:

Extreme Gradient Boosting: 0.78468
Supportive vector machine: 0.80382
Random Forest: 0.80382
C5.0: 0.78468
Single decision tree: 0.77511

Using another set of predictions

formula3 <- as.formula(Survived ~ Pclass +
                        Groupsize + Age + Sex +
                        Fare + Embarked)


set.seed(67659)

modelList3 <- caretList(
  formula3,
  data = alltrain,
  trControl=trCtr_grid,
  metric = "ROC",
  tuneList=list(
    rf=caretModelSpec(method="rf", tuneGrid= rfgrid),
    SVM=caretModelSpec(method="svmRadial", tuneGrid = SVMgrid,
                       preProcess = c("scale", "center")),
    xgb=caretModelSpec(method="xgbTree", tuneGrid = XGBgrid),
    rpart= caretModelSpec(method = "rpart", tuneGrid = rpartgrid),
    C5.0 = caretModelSpec(method = "C5.0", tuneGrid = C5.0grid)
    )
)

finaltest <- modelList3 %>% map(~ predict(., newdata = test, na.action = na.pass))

prefix <- "all_ROC_3"

for (i in names(finaltest)){
  bind_cols(PassengerID = test0$PassengerId, Survived = finaltest[[i]]) %>%
    mutate(Survived = fct_recode(Survived, "1" = "Lived", "0" = "Died")) %>%
    write_csv(str_c("./prediction/", prefix,"_" ,i,"_", today(), ".csv"))
}

Results are:

Extreme Gradient Boosting: 0.77033
Supportive vector machine: 0.78468
Random Forest: 0.79904
C5.0: 0.76076
Single decision tree: 0.77511

Make other predictor

alltrain <- alltrain %>% mutate(Title = ifelse(Age < 15, "Kid", Title))
test <- test %>% mutate(Title = ifelse(Age < 15, "Kid", Title))

formula4 <- as.formula(Survived ~ Pclass + Title +
                         Groupsize +
                        Fare + Embarked)

set.seed(67659)
modelList4 <- caretList(
  formula4, data = alltrain,
  trControl=trCtr_grid,
  metric = "ROC",
  tuneList=list(
    rf=caretModelSpec(method="rf", tuneGrid= rfgrid),
    SVM=caretModelSpec(method="svmRadial", tuneGrid = SVMgrid,
                       preProcess = c("scale", "center")),
    xgb=caretModelSpec(method="xgbTree", tuneGrid = XGBgrid),
    rpart= caretModelSpec(method = "rpart", tuneGrid = rpartgrid),
    C5.0 = caretModelSpec(method = "C5.0", tuneGrid = C5.0grid)
    )
)

finaltest <- modelList4 %>% map(~ predict(., newdata = test, na.action = na.pass))

prefix <- "all_ROC_4"

for (i in names(finaltest)){
  bind_cols(PassengerID = test0$PassengerId, Survived = finaltest[[i]]) %>%
    mutate(Survived = fct_recode(Survived, "1" = "Lived", "0" = "Died")) %>%
    write_csv(str_c("./prediction/", prefix,"_" ,i,"_", today(), ".csv"))
}

Results are:

Extreme Gradient Boosting: 0.79425
Supportive vector machine: 0.78947
Random Forest: 0.78947
C5.0: 0.80382
Single decision tree: 0.77511

Summary

We can see easily that women and children were the priority of rescue. First class has higher chance of surviving compare to other classes.

I used a lot of machine learning algorithms for this project, I believe that XGBoost can be fine tuned more to obtain better result.
The highest result gave me top 9% on Kaggle. However, I think that using complicated blackbox machine learning should be accompanied by other simpler algorithm which allows easy understanding for human.

Viet Le

Data analyst