Human Activity Recognition

Introduction

This projects uses the accelerometer measurement of 6 people over the time. The data contains the accelerometer measurement mesurement for different type of acctivities and label identifying the quality of the activity.The goal of the project is to create the prediction model to predict the label for the test data sets given.

The project describes each steps taken to build the model and all the preprocessing done in data sets to reach the mode.

Data Preprocessing and Preparation

The caret package is used for this project.Training and test data sets are read and test data is not used until the model is built.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

ptrain <- read.csv("data/pml-training.csv")
ptest <- read.csv("data/pml-testing.csv")

To estimate the out-of-sample error, the training set ptrain is splitted into training and validation set: ptrain1 and validation.

set.seed(1000)
inTrain <- createDataPartition(y = ptrain$classe, p = 0.7, list = FALSE)
ptrain1 <- ptrain[inTrain, ]
validation <- ptrain[-inTrain, ]

We will now anlyse the each features of the ptrain1 and remove feauters which do not contribute much to the final model.The features with almost zero and NA values are not useful for building model.Also, the feature having almost zero variance do not contribute.Other feature variable such as name do not contribute the model but make the prediction model more complex.Now we will analyze the ptrain1 and remove those features and we will apply same to the validation set also.

# remove features with nearly with zero variance using 
nzVar <-  nearZeroVar(ptrain1)
ptrain1 <- ptrain1[, -nzVar]

# Applying same to validation set
validation <- validation[, -nzVar]

# check for features with mostly NAs value and remove if any
mostlyNAs <- sapply(ptrain1 , function(x) mean(is.na(x))) > 0.95
ptrain1 <- ptrain1[, mostlyNAs == F]
validation <- validation[, mostlyNAs == F]

# remove variables which are not relevant for building the model. The first 5 feature varible * X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp* are not relevant.
ptrain1 <- ptrain1[, -(1:5)]
validation <- validation[, -(1:5)]

Model Construction

First, I planned to build the model using neural network.Since the problem is supervised learning and neural nets are good at classification task.To make the model more robust I will use 3 - fold cross validation to build the model.

# Use 3-fold cross validation to build the model
controlPara <- trainControl(method = "cv", number = 3, verboseIter = F,)

#build model using ptrain1
modelFit1 <- train(classe ~ ., data = ptrain1, method = "nnet", trControl = controlPara,trace = F)

## Loading required package: nnet

Model Evaluation

The fitted model is used to predict tje label for our validation set and find the out-of-sample error.Confusion matrix is used to compare the output from the our model to the actual labels of the data.

# uses modelFit1 to predict classe for our validation data sets
prediction <- predict(modelFit1, newdata = validation)

# Show the confusion matrix with other relevant information to evaluate our model
confusionMatrix(validation$classe, prediction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 821   0 691   0 162
##          B  36   4 828   0 271
##          C  16   5 925   0  80
##          D  17   0 726   0 221
##          E  41   0 741   0 300
## 
## Overall Statistics
##                                         
##                Accuracy : 0.348         
##                  95% CI : (0.336, 0.361)
##     No Information Rate : 0.665         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.192         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.882  0.44444    0.237       NA    0.290
## Specificity             0.828  0.80684    0.949    0.836    0.839
## Pos Pred Value          0.490  0.00351    0.902       NA    0.277
## Neg Pred Value          0.974  0.99895    0.385       NA    0.847
## Prevalence              0.158  0.00153    0.665    0.000    0.176
## Detection Rate          0.140  0.00068    0.157    0.000    0.051
## Detection Prevalence    0.284  0.19354    0.174    0.164    0.184
## Balanced Accuracy       0.855  0.62564    0.593       NA    0.564

The performance of this model is less than satisfactory with just 38 % accuracy.The matrix shows there are lots of misclassification. The nnet is feed forward model so the neural network with back propagation may give more accuracy then just feed forward model.

Second Model Construction

Let’s try to build the second model with Random Forest and evaluate the accuracy of model on the validation data sets.The model is build with 3-fold CV.

# Use 3-fold cross validation to build the model
controlPara <- trainControl(method = "cv", number = 3, verboseIter = F,)

#build model using ptrain1
modelFit2 <- train(classe ~. , data = ptrain1 ,method = "rf", trControl = controlPara )

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Second Model Evaulation

# uses modelFit1 to predict classe for our validation data sets
prediction <- predict(modelFit2, newdata = validation)

# Show the confusion matrix with other relevant information to evaluate our model
confusionMatrix(validation$classe, prediction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    7 1131    1    0    0
##          C    0    2 1024    0    0
##          D    0    0    5  959    0
##          E    0    0    0    5 1077
## 
## Overall Statistics
##                                         
##                Accuracy : 0.997         
##                  95% CI : (0.995, 0.998)
##     No Information Rate : 0.286         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.996         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.996    0.998    0.994    0.995    1.000
## Specificity             1.000    0.998    1.000    0.999    0.999
## Pos Pred Value          1.000    0.993    0.998    0.995    0.995
## Neg Pred Value          0.998    1.000    0.999    0.999    1.000
## Prevalence              0.286    0.193    0.175    0.164    0.183
## Detection Rate          0.284    0.192    0.174    0.163    0.183
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       0.998    0.998    0.997    0.997    0.999

The modelFit2 build with Random Forest is far better then our previos model modelFit1.The out-of-sample error for the model is less 1 %. This is the excellent model with 99 % accuarcy. We will use this model as our final model and used this to predict the classes on our test data sets.

Final Model

Since we have divided our origan training data sets ptrain into ptrain1 and validation for finding the out-of-sample error.Now we will build the model using complete trianing data sets ptrain.We also apply the same data pre-processing steps on both ptrain and ptest data sets.

Preprocessing on traing and test data sets

# remove features with nearly with zero variance using 
nzVar <-  nearZeroVar(ptrain)
ptrain <- ptrain[, -nzVar]

# Applying same to test data set
ptest <- ptest[, -nzVar]

# check for features with mostly NAs value and remove if any
mostlyNAs <- sapply(ptrain , function(x) mean(is.na(x))) > 0.95
ptrain <- ptrain[, mostlyNAs == F]
ptest <- ptest[, mostlyNAs == F]

# remove variables which are not relevant for building the model. The first 5 feature varible * X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp* are not relevant.
ptrain <- ptrain[, -(1:5)]
ptest <- ptest[, -(1:5)]

Building the Final Model

Now we fit the model on the complete training data sets.

# Use 3-fold cross validation to build the model
controlPara <- trainControl(method = "cv", number = 3, verboseIter = F)

#build model using ptrain
modelFinal <- train(classe ~. , data = ptrain ,method = "rf", trControl = controlPara )

Prediction on Test Data Set

Now we will use our final model modelFinal to predict the class of our test data set.We have applied the same data preprocessing methods to test data sets as to training data sets.

The test data sets ptest contains 20 observations. The prediction output for each observation is written to the separate file.

# predict the class of test data
prediction <- predict(modelFinal, newdata = ptest)

# create the character vector for prediction class
preds <- as.character(prediction)

# write prediction for each observation to the files
write_to_files <- function(x) {
    n <- length(x)
    for(i in 1:n) {
        filename <- paste0("output/problem_id_", i, ".txt")
        write.table(x[i], file=filename, quote=F, row.names=F, col.names=F)
    }
}

# create prediction files to submit
write_to_files(preds)