{"id":945,"date":"2014-07-14T07:18:52","date_gmt":"2014-07-14T12:18:52","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=945"},"modified":"2021-09-22T15:25:24","modified_gmt":"2021-09-22T19:25:24","slug":"analyzing-fitness-data-from-wearable-devices-in-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2014\/07\/14\/analyzing-fitness-data-from-wearable-devices-in-matlab\/","title":{"rendered":"Analyzing Fitness Data from Wearable Devices in MATLAB"},"content":{"rendered":"\r\n<div class=\"content\"><!--introduction--><p>Collecting and tracking health and fitness data with wearable devices is about to go mainstream as the smartphone giants like Apple, Google and Samsung jump into the fray. But if you collect data, what's the point if you don't analyze it?<\/p><p>Today's guest blogger, <a href=\"\" rel=author\u00e2\u20ac\ufffd>\r\nToshi Takeuchi<\/a>, would like to share an analysis of a weight lifting\r\ndataset he found in a public repository.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#bcd57c7c-cdf3-4fce-b9d6-19238e7e53d7\">Motivation, dataset, and prediction accuracy<\/a><\/li><li><a href=\"#cee589db-8cc9-4bfb-ae79-db17e1054ff7\">Data preprocessing and exploratory analysis<\/a><\/li><li><a href=\"#985b9e19-a7f9-4877-ba7c-bd7931369a9f\">Predictive Modeling with Random Forest<\/a><\/li><li><a href=\"#22cfa97a-2b83-4689-9db9-ba9bbd10d3c1\">Plot misclassification errors by number of trees<\/a><\/li><li><a href=\"#5dd56fa6-429e-47b8-8fcc-6eadfd3ff4dd\">Variable Importance<\/a><\/li><li><a href=\"#a00863d2-6678-43c1-955c-eced9fbfb1d9\">Evaluate trade-off with ROC plot<\/a><\/li><li><a href=\"#e1a4143a-6f85-4549-a1f6-916c54a940be\">The reduced model with 12 features<\/a><\/li><li><a href=\"#7ada591a-d672-4de0-b861-6120bfeeabb2\">Conclusion and the next steps - integrate your code into your app<\/a><\/li><\/ul><\/div><h4>Motivation, dataset, and prediction accuracy<a name=\"bcd57c7c-cdf3-4fce-b9d6-19238e7e53d7\"><\/a><\/h4><p>The <a title=\"http:\/\/groupware.les.inf.puc-rio.br\/har (link no longer works)\">Human Activity Recognition (HAR)<\/a> Weight Lifting Exercise Dataset provides measurements to determine \"how well an activity was performed\". 6 subjects performed 1 set of 10 Unilateral Dumbbell Biceps Curl in 5 different ways.<\/p><p>When I came across this dataset, I immediately thought of building a mobile app to advise end users whether they are performing the exercise correctly, and if not, which common mistakes they are making. I used the powerful 'Random Forest' algorithm to see if I could build a successful predictive model to enable such an app. I was able to achieve <b>99% prediction accuracy<\/b> with this dataset and I would like to share my results with you.<\/p><p>The dataset provides 39,242 samples with 159 variables labeled with 5 types of activity to detect - 1 correct method and 4 common mistakes:<\/p><div><ol><li>exactly according to the specification (Class A)<\/li><li>throwing the elbows to the front (Class B)<\/li><li>lifting the dumbbell only halfway (Class C)<\/li><li>lowering the dumbbell only halfway (Class D)<\/li><li>throwing the hips to the front (Class E)<\/li><\/ol><\/div><p>Sensors were placed on the subjects' belts, armbands, glove and dumbbells, as described below:<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/on-body-sensing-schema.png\" alt=\"\"> <\/p><p><b>Citation<\/b> <i>Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.<\/i> Read more: <\/p><h4>Data preprocessing and exploratory analysis<a name=\"cee589db-8cc9-4bfb-ae79-db17e1054ff7\"><\/a><\/h4><p>Usually you cannot use raw data directly. Preprocessing is an important part of your analysis workflow that has significant downstream impact.<\/p><div><ol><li>Load the dataset and inspect data for missing values<\/li><li>Partition the dataset for cross validation<\/li><li>Clean and normalize variables<\/li><li>Select predictor variables (features)<\/li><\/ol><\/div><p>Among those steps, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Cross-validation_%28statistics%29\">cross validation<\/a> is a key step specific to predictive modeling. Roughly speaking, you hold out part of available data for testing later, and build models using the remaining dataset. The held out set is called the 'test set' and the set we use for modeling is called the 'training set'. This makes it more difficult to <a href=\"http:\/\/en.wikipedia.org\/wiki\/Overfitting\">overfit<\/a> your model, because you can test your model against the data you didn't use in the modeling process, giving you a realistic idea how the model would actually perform with unknown data.<\/p><p>Exploratory analysis usually begins with visualizing data to get oriented with its nuances. Plots of the first four variables below show that:<\/p><div><ol><li>data is organized by class - like 'AAAAABBBBBCCCCC'. This can be an artifact of the way the data was collected and real life data may not be structured like this, so we want to use more realistic data to build our model. We can fix this issue by randomly reshuffling the data.<\/li><li>data points cluster around a few different mean values - indicating that measurements were taken by devices calibrated in a few different ways<\/li><li>those variables exhibit a distinct pattern for Class E (colored in magenta) - those variables will be useful to isolate it<\/li><\/ol><\/div><p>Review <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/preprocess.m\"><tt>preprocess.m<\/tt><\/a> if you are curious about the details.<\/p><pre class=\"codeinput\">preprocess\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/analyzeFitnessData_01.png\" alt=\"\"> <h4>Predictive Modeling with Random Forest<a name=\"985b9e19-a7f9-4877-ba7c-bd7931369a9f\"><\/a><\/h4><p>The dataset has some issues with calibration. We could further preprocess the data in order to remove calibration gaps. This time, however, I would like to use a flexible predictive algorithm called <a href=\"http:\/\/en.wikipedia.org\/wiki\/Random_forest\">Random Forest<\/a>. In MATLAB, this algorithm is implemented in the <a href=\"https:\/\/www.mathworks.com\/help\/stats\/treebagger.html\">TreeBagger<\/a> class available in <a href=\"https:\/\/www.mathworks.com\/products\/statistics\/\">Statistics Toolbox<\/a>.<\/p><p>Random Forest became popular particularly after it was used by a number of winners in <a href=\"http:\/\/www.kaggle.com\">Kaggle competitions<\/a>. It uses a large ensemble of decision trees (thus 'forest') trained on random subsets of data and uses the majority votes of those trees to predict the result. It tends to produce a highly accurate result, but the complexity of the algorithm makes it slow and difficult to interpret.<\/p><p>To accelerate the computation, I will enable the parallel option supported by <a href=\"https:\/\/www.mathworks.com\/products\/parallel-computing\/\">Parallel Computing Toolbox<\/a>. You can comment out unnecessary code if you don't use it.<\/p><p>Once the model is built, you will see the <a href=\"https:\/\/www.mathworks.com\/help\/stats\/confusionmat.html\">confusion matrix<\/a> that compares the actual class labels to predicted class labels. If everything lines up on a diagonal line, then you achieved 100% accuracy. Off-diagonal numbers are misclassification errors.<\/p><p>The model has a very high prediction accuracy even though we saw earlier that our dataset had some calibration issues.<\/p><p>Initialize parallel option - comment out if you don't use parallel<\/p><pre class=\"codeinput\">poolobj = gcp(<span class=\"string\">'nocreate'<\/span>); <span class=\"comment\">% don't create a new pool even if no pool exits<\/span>\r\n<span class=\"keyword\">if<\/span> isempty(poolobj)\r\n    parpool(<span class=\"string\">'local'<\/span>,2);\r\n<span class=\"keyword\">end<\/span>\r\nopts = statset(<span class=\"string\">'UseParallel'<\/span>,true);\r\n<\/pre><pre class=\"codeoutput\">Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers.\r\n<\/pre><p>Create a Random Forest model with 100 trees, parallel enabled...<\/p><pre class=\"codeinput\">rfmodel = TreeBagger(100,table2array(Xtrain),Ytrain,<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'Method'<\/span>,<span class=\"string\">'classification'<\/span>,<span class=\"string\">'oobvarimp'<\/span>,<span class=\"string\">'on'<\/span>,<span class=\"string\">'options'<\/span>,opts);\r\n<\/pre><pre>Here's the non-parallel version of the same model\r\n  rfmodel = TreeBagger(100,table2array(Xtrain),Ytrain,...\r\n      'Method','classification','oobvarimp','on');<\/pre><p>Predict the class labels for the test set.<\/p><pre class=\"codeinput\">[Ypred,Yscore]= predict(rfmodel,table2array(Xtest));\r\n<\/pre><p>Compute the confusion matrix and prediction accuracy.<\/p><pre class=\"codeinput\">C = confusionmat(Ytest,categorical(Ypred));\r\ndisp(array2table(C,<span class=\"string\">'VariableNames'<\/span>,rfmodel.ClassNames,<span class=\"string\">'RowNames'<\/span>,rfmodel.ClassNames))\r\nfprintf(<span class=\"string\">'Prediction accuracy on test set: %f\\n\\n'<\/span>, sum(C(logical(eye(5))))\/sum(sum(C)))\r\n<\/pre><pre class=\"codeoutput\">          A       B      C      D      E \r\n         ____    ___    ___    ___    ___\r\n    A    1133      1      0      0      0\r\n    B       4    728      1      0      0\r\n    C       0      3    645      3      0\r\n    D       0      0      8    651      0\r\n    E       0      0      0      6    741\r\nPrediction accuracy on test set: 0.993374\r\n\r\n<\/pre><h4>Plot misclassification errors by number of trees<a name=\"22cfa97a-2b83-4689-9db9-ba9bbd10d3c1\"><\/a><\/h4><p>I happened to pick 100 trees in the model, but you can check the misclassification errors relative to the number of trees used in the prediction. The plot shows that 100 is an overkill - we could use fewer trees and that will make it go faster.<\/p><pre class=\"codeinput\">figure\r\nplot(oobError(rfmodel));\r\nxlabel(<span class=\"string\">'Number of Grown Trees'<\/span>);\r\nylabel(<span class=\"string\">'Out-of-Bag Classification Error'<\/span>);\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/analyzeFitnessData_02.png\" alt=\"\"> <h4>Variable Importance<a name=\"5dd56fa6-429e-47b8-8fcc-6eadfd3ff4dd\"><\/a><\/h4><p>One major criticism of Random Forest is that it is a black box algorithm and it not easy to understand what it is doing. However, Random Forest can provide the variable importance measure, which corresponds to the change in prediction error with and without the presence of a given variable in the model.<\/p><p>For our hypothetical weight lifting trainer mobile app, Random Forest would be too cumbersome and slow to implement, so you want to use a simpler prediction model with fewer predictor variables. Random Forest can help you with selecting which predictors you can drop without sacrificing the prediction accuracy too much.<\/p><p>Let's see how you can do this with <tt>TreeBagger<\/tt>.<\/p><p>Get the names of variables<\/p><pre class=\"codeinput\">vars = Xtrain.Properties.VariableNames;\r\n<\/pre><p>Get and sort the variable importance scores. Because we turned <tt>'oobvarimp'<\/tt> to <tt>'on'<\/tt>, the model contains <tt>OOBPermutedVarDeltaError<\/tt> which acts as the variable importance measure.<\/p><pre class=\"codeinput\">varimp = rfmodel.OOBPermutedVarDeltaError';\r\n[~,idxvarimp]= sort(varimp);\r\nlabels = vars(idxvarimp);\r\n<\/pre><p>Plot the sorted scores.<\/p><pre class=\"codeinput\">figure\r\nbarh(varimp(idxvarimp),1); ylim([1 52]);\r\nset(gca, <span class=\"string\">'YTickLabel'<\/span>,labels, <span class=\"string\">'YTick'<\/span>,1:numel(labels))\r\ntitle(<span class=\"string\">'Variable Importance'<\/span>); xlabel(<span class=\"string\">'score'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/analyzeFitnessData_03.png\" alt=\"\"> <h4>Evaluate trade-off with ROC plot<a name=\"a00863d2-6678-43c1-955c-eced9fbfb1d9\"><\/a><\/h4><p>Now let's do the trade-off between the number of predictor variables and prediction accuracy. The <a href=\"http:\/\/en.wikipedia.org\/wiki\/Receiver_operating_characteristic\">Receiver operating characteristic (ROC)<\/a> plot provides a convenient way to visualize and compare performance of binary classifiers. You plot the false positive rate against the true positive rate at various prediction thresholds to produce the curves. If you get a completely random result, the curve should follow a diagonal line. If you get a 100% accuracy, then the curve should hug the upper left corner. This means you can use the area under the curve (AUC) to evaluate how well each model performs.<\/p><p>Let's plot ROC curves with different sets of predictor variables, using the \"C\" class as the positive class, since we can only do this one class at a time, and the previous confusion matrix shows more misclassification errors for this class than others. You can use <a href=\"https:\/\/www.mathworks.com\/help\/stats\/perfcurve.html\"><tt>perfcurve<\/tt><\/a> to compute ROC curves.<\/p><p>Check out <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/myROCplot.m\"><tt>myROCplot.m<\/tt><\/a> to see the details.<\/p><pre class=\"codeinput\">nFeatures = [3,5,10,15,20,25,52];\r\nmyROCplot(Xtrain,Ytrain,Xtest,Ytest,<span class=\"string\">'C'<\/span>,nFeatures,varimp)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/analyzeFitnessData_04.png\" alt=\"\"> <h4>The reduced model with 12 features<a name=\"e1a4143a-6f85-4549-a1f6-916c54a940be\"><\/a><\/h4><p>Based on the previous analysis, it looks like you can achieve a high accuracy rate even if you use as few as 10 features. Let's say we settled for 12 features. We now know you don't have to use the data from the glove for prediction, so that's one less sensor our hypothetical end users would have to buy. Given this result, I may even consider dropping the arm band, and just stick with the belt and dumbbell sensors.<\/p><p>Show which features were included.<\/p><pre class=\"codeinput\">disp(table(varimp(idxvarimp(1:12)),<span class=\"string\">'RowNames'<\/span>,vars(idxvarimp(1:12)),<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'VariableNames'<\/span>,{<span class=\"string\">'Importance'<\/span>}));\r\n<\/pre><pre class=\"codeoutput\">                        Importance\r\n                        __________\r\n    accel_belt_y        0.69746   \r\n    pitch_dumbbell      0.77255   \r\n    accel_arm_z         0.78941   \r\n    accel_belt_x        0.81696   \r\n    magnet_arm_y        0.81946   \r\n    accel_arm_x         0.87168   \r\n    magnet_arm_x        0.87897   \r\n    accel_dumbbell_x    0.92222   \r\n    magnet_forearm_x     1.0172   \r\n    total_accel_belt     1.0461   \r\n    gyros_arm_z          1.1077   \r\n    gyros_belt_x         1.1235   \r\n<\/pre><p>Shut down the parallel pool.<\/p><pre class=\"codeinput\">delete(poolobj);\r\n<\/pre><h4>Conclusion and the next steps - integrate your code into your app<a name=\"7ada591a-d672-4de0-b861-6120bfeeabb2\"><\/a><\/h4><p>Despite my initial misgivings about the data, we were able to maintain high prediction accuracy with a Random Forest model with just 12 features. However, Random Forest is probably not an ideal model to implement on a mobile app given its memory foot print and slow response time.<\/p><p>The next step is to find a simpler model, such as <a href=\"https:\/\/www.mathworks.com\/help\/stats\/mnrfit.html\">logistics regression<\/a>, that can perform decently. You may need to do more preprocessing of the data to make it work.<\/p><p>Finally, I have never tried this before, but you could generate C code out of MATLAB to incorporate it into a mobile app. Watch this webinar, <a href=\"https:\/\/www.mathworks.com\/videos\/matlab-to-iphone-made-easy-90834.html\">MATLAB to iPhone Made Easy<\/a>, for more details.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/iphoneWebinar.png\" alt=\"\"> <\/p><p>Do you track your activities with wearable devices? Have you tried analyzing the data? Tell us about your experience <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=945#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_ed798123a2fb4ebea8fda0aeecda6627() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='ed798123a2fb4ebea8fda0aeecda6627 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' ed798123a2fb4ebea8fda0aeecda6627';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2014 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_ed798123a2fb4ebea8fda0aeecda6627()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2014a<br><\/p><\/div><!--\r\ned798123a2fb4ebea8fda0aeecda6627 ##### SOURCE BEGIN #####\r\n%% Analyzing Fitness Data from Wearable Devices in MATLAB\r\n% Collecting and tracking health and fitness data with wearable devices is\r\n% about to go mainstream as the smartphone giants like Apple, Google and\r\n% Samsung jump into the fray. But if you collect data, what's the point\r\n% if you don't analyze it?\r\n%\r\n% <html>Today's guest blogger, <a href=\"\" rel=author\u00e2\u20ac\ufffd>\r\n% Toshi Takeuchi<\/a>, would like to share an analysis of a weight lifting\r\n% dataset he found in a public repository.<\/html>\r\n%\r\n%% Motivation, dataset, and prediction accuracy\r\n% The <http:\/\/groupware.les.inf.puc-rio.br\/har Human Activity Recognition\r\n% (HAR)> Weight Lifting Exercise Dataset provides measurements to determine\r\n% \"how well an activity was performed\". 6 subjects performed 1 set of \r\n% 10 Unilateral Dumbbell Biceps Curl in 5 different ways. \r\n% \r\n% When I came across this dataset, I immediately thought of building a\r\n% mobile app to advise end users whether they are performing the exercise\r\n% correctly, and if not, which common mistakes they are making. I used the\r\n% powerful 'Random Forest' algorithm to see if I could build a\r\n% successful predictive model to enable such an app. I was able to\r\n% achieve *99% prediction accuracy* with this dataset and I would like to\r\n% share my results with you.\r\n% \r\n% The dataset provides 39,242 samples with 159 variables labeled with 5\r\n% types of activity to detect - 1 correct method and 4 common mistakes:\r\n% \r\n% # exactly according to the specification (Class A)\r\n% # throwing the elbows to the front (Class B)\r\n% # lifting the dumbbell only halfway (Class C)\r\n% # lowering the dumbbell only halfway (Class D)\r\n% # throwing the hips to the front (Class E)\r\n% \r\n% Sensors were placed on the subjects' belts, armbands, glove and\r\n% dumbbells, as described below:\r\n% \r\n% <<on-body-sensing-schema.png>>\r\n%\r\n% *Citation*\r\n% _Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. \r\n% Qualitative Activity Recognition of Weight Lifting Exercises. \r\n% Proceedings of 4th International Conference in Cooperation with SIGCHI \r\n% (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013._\r\n% Read more: http:\/\/groupware.les.inf.puc-rio.br\/har#ixzz34dpS6oks\r\n%\r\n%% Data preprocessing and exploratory analysis\r\n% Usually you cannot use raw data directly. Preprocessing is an important\r\n% part of your analysis workflow that has significant downstream impact. \r\n% \r\n% # Load the dataset and inspect data for missing values\r\n% # Partition the dataset for cross validation\r\n% # Clean and normalize variables\r\n% # Select predictor variables (features)\r\n%\r\n% Among those steps,\r\n% <http:\/\/en.wikipedia.org\/wiki\/Cross-validation_%28statistics%29 cross\r\n% validation> is a key step specific to predictive modeling. Roughly\r\n% speaking, you hold out part of available data for testing later, and\r\n% build models using the remaining dataset. The held out set is called the\r\n% 'test set' and the set we use for modeling is called the 'training set'.\r\n% This makes it more difficult to <http:\/\/en.wikipedia.org\/wiki\/Overfitting\r\n% overfit> your model, because you can test your model against the data you\r\n% didn't use in the modeling process, giving you a realistic idea how the\r\n% model would actually perform with unknown data.\r\n%\r\n% Exploratory analysis usually begins with visualizing data to get oriented\r\n% with its nuances. Plots of the first four variables below show that:\r\n% \r\n% # data is organized by class - like 'AAAAABBBBBCCCCC'. This can be an\r\n% artifact of the way the data was collected and real life data may not\r\n% be structured like this, so we want to use more realistic data to\r\n% build our model. We can fix this issue by randomly reshuffling the data.\r\n% # data points cluster around a few different mean values - \r\n% indicating that measurements were taken by devices calibrated in a few \r\n% different ways\r\n% # those variables exhibit a distinct pattern for Class E (colored in \r\n% magenta) - those variables will be useful to isolate it\r\n%\r\n% Review <https:\/\/blogs.mathworks.com\/images\/loren\/2014\/preprocess.m\r\n% |preprocess.m|> if you are curious about the details.\r\n\r\npreprocess\r\n\r\n%% Predictive Modeling with Random Forest\r\n% The dataset has some issues with calibration. We could further\r\n% preprocess the data in order to remove calibration gaps. This time,\r\n% however, I would like to use a flexible predictive algorithm called\r\n% <http:\/\/en.wikipedia.org\/wiki\/Random_forest Random Forest>. In MATLAB,\r\n% this algorithm is implemented in the\r\n% <https:\/\/www.mathworks.com\/help\/stats\/treebagger.html TreeBagger> class\r\n% available in <https:\/\/www.mathworks.com\/products\/statistics\/ Statistics\r\n% Toolbox>.\r\n% \r\n% Random Forest became popular particularly after it was used by a number\r\n% of winners in <http:\/\/www.kaggle.com Kaggle competitions>. It uses a\r\n% large ensemble of decision trees (thus 'forest') trained on random\r\n% subsets of data and uses the majority votes of those trees to predict the\r\n% result. It tends to produce a highly accurate result, but the complexity\r\n% of the algorithm makes it slow and difficult to interpret.\r\n% \r\n% To accelerate the computation, I will enable the parallel option\r\n% supported by <https:\/\/www.mathworks.com\/products\/parallel-computing\/\r\n% Parallel Computing Toolbox>. You can comment out unnecessary code if you\r\n% don't use it.\r\n%\r\n% Once the model is built, you will see the \r\n% <https:\/\/www.mathworks.com\/help\/stats\/confusionmat.html confusion matrix>\r\n% that compares the actual class labels to predicted class labels. If \r\n% everything lines up on a diagonal line, then you achieved 100% accuracy. \r\n% Off-diagonal numbers are misclassification errors. \r\n%\r\n% The model has a very high prediction accuracy even though we saw earlier\r\n% that our dataset had some calibration issues. \r\n%%\r\n% Initialize parallel option - comment out if you don't use parallel\r\npoolobj = gcp('nocreate'); % don't create a new pool even if no pool exits\r\nif isempty(poolobj)\r\n    parpool('local',2);\r\nend\r\nopts = statset('UseParallel',true);\r\n%%\r\n% Create a Random Forest model with 100 trees, parallel enabled...\r\nrfmodel = TreeBagger(100,table2array(Xtrain),Ytrain,...\r\n    'Method','classification','oobvarimp','on','options',opts);\r\n%%\r\n%  Here's the non-parallel version of the same model\r\n%    rfmodel = TreeBagger(100,table2array(Xtrain),Ytrain,...\r\n%        'Method','classification','oobvarimp','on');\r\n%%\r\n% Predict the class labels for the test set.\r\n[Ypred,Yscore]= predict(rfmodel,table2array(Xtest));\r\n%%\r\n% Compute the confusion matrix and prediction accuracy.\r\nC = confusionmat(Ytest,categorical(Ypred));\r\ndisp(array2table(C,'VariableNames',rfmodel.ClassNames,'RowNames',rfmodel.ClassNames))\r\nfprintf('Prediction accuracy on test set: %f\\n\\n', sum(C(logical(eye(5))))\/sum(sum(C)))\r\n\r\n%% Plot misclassification errors by number of trees\r\n% I happened to pick 100 trees in the model, but you can check the\r\n% misclassification errors relative to the number of trees used in the\r\n% prediction. The plot shows that 100 is an overkill - we could use fewer \r\n% trees and that will make it go faster. \r\n\r\nfigure\r\nplot(oobError(rfmodel));\r\nxlabel('Number of Grown Trees');\r\nylabel('Out-of-Bag Classification Error');\r\n\r\n%% Variable Importance\r\n% One major criticism of Random Forest is that it is a black box algorithm\r\n% and it not easy to understand what it is doing. However, Random Forest\r\n% can provide the variable importance measure, which corresponds to the\r\n% change in prediction error with and without the presence of a given\r\n% variable in the model.\r\n%\r\n% For our hypothetical weight lifting trainer mobile app, Random Forest\r\n% would be too cumbersome and slow to implement, so you want to use a\r\n% simpler prediction model with fewer predictor variables. Random Forest\r\n% can help you with selecting which predictors you can drop without\r\n% sacrificing the prediction accuracy too much. \r\n%\r\n% Let's see how you can do this with |TreeBagger|. \r\n\r\n%%\r\n% Get the names of variables \r\nvars = Xtrain.Properties.VariableNames;\r\n%%\r\n% Get and sort the variable importance scores.\r\n% Because we turned |'oobvarimp'| to |'on'|, the model contains \r\n% |OOBPermutedVarDeltaError| which acts as the variable importance measure.\r\nvarimp = rfmodel.OOBPermutedVarDeltaError';\r\n[~,idxvarimp]= sort(varimp);\r\nlabels = vars(idxvarimp);\r\n%%\r\n% Plot the sorted scores.\r\nfigure\r\nbarh(varimp(idxvarimp),1); ylim([1 52]);\r\nset(gca, 'YTickLabel',labels, 'YTick',1:numel(labels))\r\ntitle('Variable Importance'); xlabel('score')\r\n\r\n%% Evaluate trade-off with ROC plot\r\n% Now let's do the trade-off between the number of predictor variables and\r\n% prediction accuracy. The\r\n% <http:\/\/en.wikipedia.org\/wiki\/Receiver_operating_characteristic Receiver\r\n% operating characteristic (ROC)> plot provides a convenient way to\r\n% visualize and compare performance of binary classifiers. You plot the\r\n% false positive rate against the true positive rate at various prediction\r\n% thresholds to produce the curves. If you get a completely random result,\r\n% the curve should follow a diagonal line. If you get a 100% accuracy, then\r\n% the curve should hug the upper left corner. This means you can use the\r\n% area under the curve (AUC) to evaluate how well each model performs.\r\n%\r\n% Let's plot ROC curves with different sets of predictor variables, using\r\n% the \"C\" class as the positive class, since we can only do this one class\r\n% at a time, and the previous confusion matrix shows more misclassification\r\n% errors for this class than others. You can use\r\n% <https:\/\/www.mathworks.com\/help\/stats\/perfcurve.html |perfcurve|> to\r\n% compute ROC curves.\r\n%\r\n% Check out <https:\/\/blogs.mathworks.com\/images\/loren\/2014\/myROCplot.m\r\n% |myROCplot.m|> to see the details.\r\n\r\nnFeatures = [3,5,10,15,20,25,52];\r\nmyROCplot(Xtrain,Ytrain,Xtest,Ytest,'C',nFeatures,varimp)\r\n\r\n%% The reduced model with 12 features\r\n% Based on the previous analysis, it looks like you can achieve a high\r\n% accuracy rate even if you use as few as 10 features. Let's say we settled\r\n% for 12 features. We now know you don't have to use the data from the\r\n% glove for prediction, so that's one less sensor our hypothetical end\r\n% users would have to buy. Given this result, I may even consider dropping\r\n% the arm band, and just stick with the belt and dumbbell sensors.\r\n%%\r\n% Show which features were included.\r\ndisp(table(varimp(idxvarimp(1:12)),'RowNames',vars(idxvarimp(1:12)),...\r\n    'VariableNames',{'Importance'}));\r\n%%\r\n% Shut down the parallel pool.\r\ndelete(poolobj);\r\n\r\n%% Conclusion and the next steps - integrate your code into your app\r\n% Despite my initial misgivings about the data, we were able to maintain\r\n% high prediction accuracy with a Random Forest model with just 12\r\n% features. However, Random Forest is probably not an ideal model to\r\n% implement on a mobile app given its memory foot print and slow response\r\n% time.\r\n%\r\n% The next step is to find a simpler model, such as \r\n% <https:\/\/www.mathworks.com\/help\/stats\/mnrfit.html logistics regression>,\r\n% that can perform decently. You may need to do more preprocessing of the \r\n% data to make it work. \r\n%\r\n% Finally, I have never tried this before, but you could generate C code\r\n% out of MATLAB to incorporate it into a mobile app. Watch this webinar,\r\n% <https:\/\/www.mathworks.com\/videos\/matlab-to-iphone-made-easy-90834.html\r\n% MATLAB to iPhone Made Easy>, for more details. \r\n%\r\n% <<iphoneWebinar.png>>\r\n%\r\n% Do you track your activities with wearable devices? Have you tried\r\n% analyzing the data? Tell us about your experience\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=945#respond here>!\r\n##### SOURCE END ##### ed798123a2fb4ebea8fda0aeecda6627\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/iphoneWebinar.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Collecting and tracking health and fitness data with wearable devices is about to go mainstream as the smartphone giants like Apple, Google and Samsung jump into the fray. But if you collect data, what's the point if you don't analyze it?... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2014\/07\/14\/analyzing-fitness-data-from-wearable-devices-in-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[33,49,39,61],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/945"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=945"}],"version-history":[{"count":8,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/945\/revisions"}],"predecessor-version":[{"id":4725,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/945\/revisions\/4725"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=945"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=945"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=945"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}