PML_Exercise

Backgroud

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har

Feature Extration

From the page, we know that the timestamp and window data is just a time record, that has no affection for our response variable classe, so we remove them as below.

> trainingN <- subset(training, select =
+                       -c(X, user_name,
+                          raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp,
+                          new_window, num_window)
+ )

After that, when we explore the data, we find that many variables filled with NAs. That is meaningless, possibly that data was never measured. So we remove that data again.

> var.bad <- function(x){sum(is.na(x)*1)>length(x)/2}
> matrix.littleNA <- function(m){
+   threshold = dim(m)[1]/2
+   r = c()
+   for( v in 1:dim(m)[2]){
+     if(sum(is.na(m[, v])*1) < threshold) r[length(r)+1] = v
+   }
+   r
+ }
> trainingN <- subset(trainingN, select = matrix.littleNA(trainingN))
> summary(trainingN)

Now the data is clean, let's begin to build the model.

Model Buiding

To build our model, I divide the data into two parts, one for training and the other for validation. Here I used the random forest to build my model. And I also used the cross validation to select the variables and optimize my model.

> set.seed(16888)
> inTrain <- createDataPartition(y=trainingN$classe, p=0.1, list=FALSE)
> training.rf <- trainingN[inTrain, ]
> testing.rf <- trainingN[-inTrain, ]
> modFit <- train(classe ~ ., method="rf", trControl=trainControl(method="cv"), data=training.rf)
> print(modFit$finalModel)

Call:
 randomForest(x = x, y = y, mtry = param$mtry)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 27

        OOB estimate of  error rate: 4.94%
Confusion matrix:
    A   B   C   D   E class.error
A 545   2   2   7   2  0.02329749
B  18 353   6   2   1  0.07105263
C   1  17 321   4   0  0.06413994
D   3   5  12 301   1  0.06521739
E   1   6   3   4 347  0.03878116

I used the default option for the cross validation method, so the data was again divided to 10 folds. This can be seen from the following.

> print(modFit$resample)
Accuracy     Kappa Resample
1  0.9000000 0.8736097   Fold02
2  0.9191919 0.8974625   Fold01
3  0.9504950 0.9371812   Fold04
4  0.9393939 0.9229172   Fold03
5  0.9306931 0.9122502   Fold06
6  0.9306931 0.9122066   Fold05
7  0.8910891 0.8615576   Fold08
8  0.8888889 0.8580923   Fold07
9  0.8600000 0.8222222   Fold10
10 0.8383838 0.7948718   Fold09

Model Accuracy

To estimate the accuracy of our model, we use a independent validation part of the whole data

> testing.predict <- predict(modFit, testing.rf)
> testing.predict <- predict(modFit, testing.rf)
> confusionMatrix(testing.rf$classe, testing.predict)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 4949   34   20    7   12
         B  134 3099  144   32    8
         C    0   87 2965   25    2
         D   14   12  138 2722    8
         E    3   39   36   47 3121

Overall Statistics

               Accuracy : 0.9546
                 95% CI : (0.9514, 0.9576)
    No Information Rate : 0.2888
    P-Value [Acc > NIR] : < 2.2e-16

                  Kappa : 0.9425
 Mcnemar's Test P-Value : < 2.2e-16

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9704   0.9474   0.8977   0.9608   0.9905
Specificity            0.9942   0.9779   0.9921   0.9884   0.9914
Pos Pred Value         0.9855   0.9069   0.9630   0.9406   0.9615
Neg Pred Value         0.9881   0.9879   0.9768   0.9925   0.9979
Prevalence             0.2888   0.1852   0.1871   0.1604   0.1784
Detection Rate         0.2803   0.1755   0.1679   0.1542   0.1767
Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
Balanced Accuracy      0.9823   0.9627   0.9449   0.9746   0.9909

From the output, we can estimate the accuracy to be 0.92.

Test Data Prediction

At last, let's try to predict the testing data.

> testing[, "classe"] = predict(modFit, testing)
> testing[, c("problem_id", "classe")]
   problem_id classe
1           1      B
2           2      A
3           3      B
4           4      A
5           5      A
6           6      E
7           7      D
8           8      B
9           9      A
10         10      A
11         11      B
12         12      C
13         13      B
14         14      A
15         15      E
16         16      E
17         17      A
18         18      D
19         19      A
20         20      B

Further More

The accuracy is very good though I only used 10 percent of the data to train the model for a immediate result. This can be improved when running more data with more time.

Declaration

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!