-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathKaggle_Titanic_pp.Rmd
204 lines (155 loc) · 7.7 KB
/
Kaggle_Titanic_pp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Titanic - Machine Learning from Disaster (Kaggle Competition)"
author: "Yixi Deng"
output:
html_document:
df_print: paged
toc: true
toc_float: true
---
# Score: 0.78947
# Loading the data
```{r}
setwd("C:/Users/User/Desktop/JobSeeking/Project/Kaggle/titanic/120721")
train <- read.csv(file="train.csv", stringsAsFactors = FALSE, header = TRUE)
test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
```
## Overview
As we have read in the dataset as a `data.table` object, we can use `head(test)` to look the first 6 rows. Next, we can use `summary()` to check summary statistics such as mean, min, and max values for each variable to see if there are any obvious outliers in the data and if there are any nulls in any of the columns (`NA's: number of nulls` will appear in the output if there are any nulls).
```{r}
# Examine passengers data in train dataset
head(train)
```
```{r}
# Examine passengers data in test dataset
head(test)
```
```{r}
# Summarise the data to check for nulls and possible outliers for the train dataset
summary(train)
```
```{r}
# Summarise the data to check for nulls and possible outliers for the test dataset
summary(test)
```
There are 177 nulls in `Age` column in train dataset and 86 nulls in `Age` column and 1 null in `Fare` column in test dataset; morever, we found there are some blank cells in the train dataset. So next, we need to see how many nulls and blank cells are in each variable.
# Identifying missing value
## Missing values in train dataset
```{r}
# Count both NA and blank cels for train dataset
sapply(train, function(x) sum(is.na(x)|x == ""))
```
```{r}
# Show them in percentage for train dataset
sapply(train, function(x) sum(is.na(x)|x == "")/length(x)*100)
```
There are 177 nulls in `Age` column, 687 blank cells in `Cabin` column, and 2 blank cells in `Embarked` column, which indicate that 19.87% of data in Age column, 77.10% of data in `Cabin` column, and 0.22% of data in `Embarked` column are missing in the train dataset. We have to leave `Cabin` column with those blank cells for it has too many missing data; however, we could try to fill the missing values in both `Age` and `Embarked` based on the existing data.
## Missing values in test dataset
```{r}
# Count both NA and blank cells for test dataset
sapply(test, function(x) sum(is.na(x)|x == ""))
```
```{r}
# Show them in percentage for test dataset
sapply(test, function(x) sum(is.na(x)|x == "")/length(x)*100)
```
There are 86 nulls in `Age` column, 327 blank cells in `Cabin` column, and 1 blank cell in `Fare` column, which indicate that 20.57% of data in `Age` column, 78.23% of data in `Cabin` column, and 0.24% of data in `Embarked` column are missing in the test dataset. We also have to leave `Cabin` column with those blank cells for it has too many missing data; on the other hand, we could try to fill the missing values in both `Age` and `Fare` based on the existing data.
# Filling missing values
## Fill the missing values in Age variable
```{r}
# Fill the missing values of Age variable in train dataset with median
age.median.train <- median(train$Age,na.rm = TRUE)
train[is.na(train$Age), "Age"] <- age.median.train
sapply(train, function(x) sum(is.na(x)|x == ""))
```
```{r}
# Fill the missing values of Age variable in test dataset with median
age.median.test <- median(test$Age,na.rm = TRUE)
test[is.na(test$Age), "Age"] <- age.median.test
sapply(test, function(x) sum(is.na(x)|x == ""))
```
We use the median of `Age` to fill the missing values in the `Age` column for both train and test datasets. Now no missing value exists in the `Age` column any more.
## Fill the missing values in Fare variable
```{r}
# Fill the missing values of Fare variable in test dataset with median
fare.median.test <- median(test$Fare,na.rm = TRUE)
test[is.na(test$Fare), "Fare"] <- fare.median.test
sapply(test, function(x) sum(is.na(x)|x == ""))
```
We use the median of `Fare` to fill the missing values in the `Fare` column for test dataset. There is no missing `Fare` value now.
## Fill the missing values in Embarked variable
```{r}
# Fill the missing values of Embarked variable in test dataset with "S"
table(train$Embarked)
train[train$Embarked =='',"Embarked"] <- 'S'
table(train$Embarked)
sapply(train, function(x) sum(is.na(x)|x == ""))
```
We use the 'S' to fill the missing `Embarked` cell, for it is the value with the highest frequency, which would affect the result least if we use the wrong value to fill the blank cell.
# Fitting the model
Before we fit the model, we need to use `str()` to look at the format of each column to see whether we need change the format of some variables into appropriate ones.
```{r}
str(train)
```
We found `Survived`, `Pclass`, `Sex`, and `Embarked` variables need to be changed to factor format.
```{r}
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
train$Sex <- as.factor(train$Sex)
train$Embarked <- as.factor(train$Embarked)
```
```{r}
str(train)
```
Now, we can begin to fit our model.
First, we fit the model with all variables except `PassengerId`, `Name`, `Ticket` and `Cabin`. The reason to remove `Cabin` is obvious, for it has too many missing values. We also removed `PassengerId`, `Name`, and `Ticket`, for they are all variables with unique values, which we may take into account to use them to make new variables if we have more evidences to show they are related to the survive rate.
```{r}
model_0 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
family = "binomial")
summary(model_0)
```
In model_0, we found that the P-values of `Parch` and `Fare` are greater than 0.05, so we decided to remove `Parch` to see whether the new model could have all variables with p-values which are less than 0.05 and have a lower AIC value.
```{r}
# Remove Parch variable
model_01 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked,
family = "binomial")
summary(model_01)
```
In model_01, we found the AIC value decreased which means it is a better model than model_0; however, the p-value of `Fare` is still larger than 0.05. So we remove `Fare` and fit a new model.
```{r}
# Remove Fare variable
model_02 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
family = "binomial")
summary(model_02)
```
Now, the model_02 has all variables with p-values which are less than 0.05 and has the smallest AIC value, so it is our final fitted model.
# Prediction
Before we use the fitted model to predict `Survived` values in test data, we need to use `str()` to look at the format of each column to see whether the formats of all variables are the same as those in train dataset.
```{r}
str(test)
```
`Survived`, `Pclass`, `Sex`, and `Embarked` variables need to be changed to factor format.
```{r}
test$Survived <- NA
test$Survived <- as.factor(test$Survived)
test$Pclass <- as.factor(test$Pclass)
test$Sex <- as.factor(test$Sex)
test$Embarked <- as.factor(test$Embarked)
```
```{r}
str(test)
```
Now, we are ready to predict with random forest algorithm.
```{r}
library(randomForest)
```
```{r}
model <- randomForest(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
data = train, ntree = 500, ntry = 3, nodesize = 0.01 * nrow(test))
Survived <- predict(model, newdata=test)
PassengerId <- test$PassengerId
output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived
write.csv(output.df, file="kaggle_submission_13.csv", row.names = FALSE)
```
Finally, we can use our output to see how accurate our prediction is.