-
Notifications
You must be signed in to change notification settings - Fork 0
PML_Exercise
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har
From the page, we know that the timestamp and window data is just a time record, that has no affection for our response variable classe, so we remove them as below.
> trainingN <- subset(training, select =
+ -c(X, user_name,
+ raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp,
+ new_window, num_window)
+ )
After that, when we explore the data, we find that many variables filled with NAs. That is meaningless, possibly that data was never measured. So we remove that data again.
> var.bad <- function(x){sum(is.na(x)*1)>length(x)/2}
> matrix.littleNA <- function(m){
+ threshold = dim(m)[1]/2
+ r = c()
+ for( v in 1:dim(m)[2]){
+ if(sum(is.na(m[, v])*1) < threshold) r[length(r)+1] = v
+ }
+ r
+ }
> trainingN <- subset(trainingN, select = matrix.littleNA(trainingN))
> summary(trainingN)
Now the data is clean, let's begin to build the model.
To build our model, I divide the data into two parts, one for training and the other for validation. Here I used the random forest to build my model. And I also used the cross validation to select the variables and optimize my model.
> set.seed(16888)
> inTrain <- createDataPartition(y=trainingN$classe, p=0.1, list=FALSE)
> training.rf <- trainingN[inTrain, ]
> testing.rf <- trainingN[-inTrain, ]
> modFit <- train(classe ~ ., method="rf", trControl=trainControl(method="cv"), data=training.rf)
> print(modFit$finalModel)
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 27
OOB estimate of error rate: 4.94%
Confusion matrix:
A B C D E class.error
A 545 2 2 7 2 0.02329749
B 18 353 6 2 1 0.07105263
C 1 17 321 4 0 0.06413994
D 3 5 12 301 1 0.06521739
E 1 6 3 4 347 0.03878116
I used the default option for the cross validation method, so the data was again divided to 10 folds. This can be seen from the following.
> print(modFit$resample)
Accuracy Kappa Resample
1 0.9000000 0.8736097 Fold02
2 0.9191919 0.8974625 Fold01
3 0.9504950 0.9371812 Fold04
4 0.9393939 0.9229172 Fold03
5 0.9306931 0.9122502 Fold06
6 0.9306931 0.9122066 Fold05
7 0.8910891 0.8615576 Fold08
8 0.8888889 0.8580923 Fold07
9 0.8600000 0.8222222 Fold10
10 0.8383838 0.7948718 Fold09
To estimate the accuracy of our model, we use a independent validation part of the whole data
> testing.predict <- predict(modFit, testing.rf)
> testing.predict <- predict(modFit, testing.rf)
> confusionMatrix(testing.rf$classe, testing.predict)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4949 34 20 7 12
B 134 3099 144 32 8
C 0 87 2965 25 2
D 14 12 138 2722 8
E 3 39 36 47 3121
Overall Statistics
Accuracy : 0.9546
95% CI : (0.9514, 0.9576)
No Information Rate : 0.2888
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9425
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9704 0.9474 0.8977 0.9608 0.9905
Specificity 0.9942 0.9779 0.9921 0.9884 0.9914
Pos Pred Value 0.9855 0.9069 0.9630 0.9406 0.9615
Neg Pred Value 0.9881 0.9879 0.9768 0.9925 0.9979
Prevalence 0.2888 0.1852 0.1871 0.1604 0.1784
Detection Rate 0.2803 0.1755 0.1679 0.1542 0.1767
Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Balanced Accuracy 0.9823 0.9627 0.9449 0.9746 0.9909
From the output, we can estimate the accuracy to be 0.92.
At last, let's try to predict the testing data.
> testing[, "classe"] = predict(modFit, testing)
> testing[, c("problem_id", "classe")]
problem_id classe
1 1 B
2 2 A
3 3 B
4 4 A
5 5 A
6 6 E
7 7 D
8 8 B
9 9 A
10 10 A
11 11 B
12 12 C
13 13 B
14 14 A
15 15 E
16 16 E
17 17 A
18 18 D
19 19 A
20 20 B
The accuracy is very good though I only used 10 percent of the data to train the model for a immediate result. This can be improved when running more data with more time.
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.