This projects uses the accelerometer measurement of 6 people over the time. The data contains the accelerometer measurement mesurement for different type of acctivities and label identifying the quality of the activity.The goal of the project is to create the prediction model to predict the label for the test data sets given.
The project describes each steps taken to build the model and all the preprocessing done in data sets to reach the mode.
The caret
package is used for this project.Training and test data sets are read and test data is not used until the model is built.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
ptrain <- read.csv("data/pml-training.csv")
ptest <- read.csv("data/pml-testing.csv")
To estimate the out-of-sample error, the training set ptrain
is splitted into training and validation set: ptrain1
and validation
.
set.seed(1000)
inTrain <- createDataPartition(y = ptrain$classe, p = 0.7, list = FALSE)
ptrain1 <- ptrain[inTrain, ]
validation <- ptrain[-inTrain, ]
We will now anlyse the each features of the ptrain1
and remove feauters which do not contribute much to the final model.The features with almost zero and NA values are not useful for building model.Also, the feature having almost zero variance do not contribute.Other feature variable such as name do not contribute the model but make the prediction model more complex.Now we will analyze the ptrain1
and remove those features and we will apply same to the validation
set also.
# remove features with nearly with zero variance using
nzVar <- nearZeroVar(ptrain1)
ptrain1 <- ptrain1[, -nzVar]
# Applying same to validation set
validation <- validation[, -nzVar]
# check for features with mostly NAs value and remove if any
mostlyNAs <- sapply(ptrain1 , function(x) mean(is.na(x))) > 0.95
ptrain1 <- ptrain1[, mostlyNAs == F]
validation <- validation[, mostlyNAs == F]
# remove variables which are not relevant for building the model. The first 5 feature varible * X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp* are not relevant.
ptrain1 <- ptrain1[, -(1:5)]
validation <- validation[, -(1:5)]
First, I planned to build the model using neural network
.Since the problem is supervised learning and neural nets are good at classification task.To make the model more robust I will use 3 - fold cross validation to build the model.
# Use 3-fold cross validation to build the model
controlPara <- trainControl(method = "cv", number = 3, verboseIter = F,)
#build model using ptrain1
modelFit1 <- train(classe ~ ., data = ptrain1, method = "nnet", trControl = controlPara,trace = F)
## Loading required package: nnet
The fitted model is used to predict tje label for our validation set and find the out-of-sample error.Confusion matrix is used to compare the output from the our model to the actual labels of the data.
# uses modelFit1 to predict classe for our validation data sets
prediction <- predict(modelFit1, newdata = validation)
# Show the confusion matrix with other relevant information to evaluate our model
confusionMatrix(validation$classe, prediction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 821 0 691 0 162
## B 36 4 828 0 271
## C 16 5 925 0 80
## D 17 0 726 0 221
## E 41 0 741 0 300
##
## Overall Statistics
##
## Accuracy : 0.348
## 95% CI : (0.336, 0.361)
## No Information Rate : 0.665
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.192
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.882 0.44444 0.237 NA 0.290
## Specificity 0.828 0.80684 0.949 0.836 0.839
## Pos Pred Value 0.490 0.00351 0.902 NA 0.277
## Neg Pred Value 0.974 0.99895 0.385 NA 0.847
## Prevalence 0.158 0.00153 0.665 0.000 0.176
## Detection Rate 0.140 0.00068 0.157 0.000 0.051
## Detection Prevalence 0.284 0.19354 0.174 0.164 0.184
## Balanced Accuracy 0.855 0.62564 0.593 NA 0.564
The performance of this model is less than satisfactory with just 38 % accuracy.The matrix shows there are lots of misclassification. The nnet is feed forward model so the neural network with back propagation may give more accuracy then just feed forward model.
Let’s try to build the second model with Random Forest
and evaluate the accuracy of model on the validation data sets.The model is build with 3-fold CV.
# Use 3-fold cross validation to build the model
controlPara <- trainControl(method = "cv", number = 3, verboseIter = F,)
#build model using ptrain1
modelFit2 <- train(classe ~. , data = ptrain1 ,method = "rf", trControl = controlPara )
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
The fitted model is used to predict tje label for our validation set and find the out-of-sample error.Confusion matrix is used to compare the output from the our model to the actual labels of the data.
# uses modelFit1 to predict classe for our validation data sets
prediction <- predict(modelFit2, newdata = validation)
# Show the confusion matrix with other relevant information to evaluate our model
confusionMatrix(validation$classe, prediction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 7 1131 1 0 0
## C 0 2 1024 0 0
## D 0 0 5 959 0
## E 0 0 0 5 1077
##
## Overall Statistics
##
## Accuracy : 0.997
## 95% CI : (0.995, 0.998)
## No Information Rate : 0.286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.996
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.996 0.998 0.994 0.995 1.000
## Specificity 1.000 0.998 1.000 0.999 0.999
## Pos Pred Value 1.000 0.993 0.998 0.995 0.995
## Neg Pred Value 0.998 1.000 0.999 0.999 1.000
## Prevalence 0.286 0.193 0.175 0.164 0.183
## Detection Rate 0.284 0.192 0.174 0.163 0.183
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 0.998 0.998 0.997 0.997 0.999
The modelFit2
build with Random Forest
is far better then our previos model modelFit1
.The out-of-sample
error for the model is less 1 %. This is the excellent model with 99 % accuarcy. We will use this model as our final model and used this to predict the classes on our test data sets.
Since we have divided our origan training data sets ptrain
into ptrain1
and validation
for finding the out-of-sample error.Now we will build the model using complete trianing data sets ptrain
.We also apply the same data pre-processing steps on both ptrain
and ptest
data sets.
# remove features with nearly with zero variance using
nzVar <- nearZeroVar(ptrain)
ptrain <- ptrain[, -nzVar]
# Applying same to test data set
ptest <- ptest[, -nzVar]
# check for features with mostly NAs value and remove if any
mostlyNAs <- sapply(ptrain , function(x) mean(is.na(x))) > 0.95
ptrain <- ptrain[, mostlyNAs == F]
ptest <- ptest[, mostlyNAs == F]
# remove variables which are not relevant for building the model. The first 5 feature varible * X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp* are not relevant.
ptrain <- ptrain[, -(1:5)]
ptest <- ptest[, -(1:5)]
Now we fit the model on the complete training data sets.
# Use 3-fold cross validation to build the model
controlPara <- trainControl(method = "cv", number = 3, verboseIter = F)
#build model using ptrain
modelFinal <- train(classe ~. , data = ptrain ,method = "rf", trControl = controlPara )
Now we will use our final model modelFinal
to predict the class of our test data set.We have applied the same data preprocessing methods to test data sets as to training data sets.
The test data sets ptest
contains 20 observations. The prediction output for each observation is written to the separate file.
# predict the class of test data
prediction <- predict(modelFinal, newdata = ptest)
# create the character vector for prediction class
preds <- as.character(prediction)
# write prediction for each observation to the files
write_to_files <- function(x) {
n <- length(x)
for(i in 1:n) {
filename <- paste0("output/problem_id_", i, ".txt")
write.table(x[i], file=filename, quote=F, row.names=F, col.names=F)
}
}
# create prediction files to submit
write_to_files(preds)