{"id":297,"date":"2011-11-21T15:55:34","date_gmt":"2011-11-21T15:55:34","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/21\/subset-selection-and-regularization\/"},"modified":"2011-11-15T20:16:24","modified_gmt":"2011-11-15T20:16:24","slug":"subset-selection-and-regularization","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/21\/subset-selection-and-regularization\/","title":{"rendered":"Subset Selection and Regularization"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <introduction>\r\n      <p><i>This week Richard Willey from technical marketing will be guest blogging about subset selection and regularization.<\/i> This week's blog posting is motivated by a pair of common challenges that occur in applied curve fitting.\r\n      <\/p>\r\n      <div>\r\n         <ul>\r\n            <li>The first challenge is how best to create accurate predictive models when your independent variables exhibit strong correlation.\r\n                In many cases you can improve upon the results of an ordinary least square regression if you reduce the number of predictors\r\n               or, alternatively, shrink the coefficient values towards zero.\r\n            <\/li>\r\n            <li>The second challenge involves generating an accurate predictive model when you are constrained with respect to the number\r\n               of predictors you are working with.  For example, assume that you need to embed your model onto a controller.  You have 20\r\n               possible predictors to chose from, but you only have enough memory to allow for 8 independent variables.\r\n            <\/li>\r\n         <\/ul>\r\n      <\/div>\r\n      <p>This blog post will show two different sets of techniques to solve these related challenges.  The first set of techniques\r\n         are based on a combination of feature selection and cross validation.  The second set of techniques are use regularization\r\n         algorithms like ridge regression, lasso and the elastic net. All of these algorithms can be found in <a href=\"https:\/\/www.mathworks.com\/products\/statistics\/\">Statistics Toolbox<\/a>.\r\n      <\/p>\r\n   <\/introduction>\r\n   <h3>Contents<\/h3>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"#1\">Create a Dataset<\/a><\/li>\r\n         <li><a href=\"#13\">Compare the Results<\/a><\/li>\r\n         <li><a href=\"#15\">Perform a Simulation<\/a><\/li>\r\n         <li><a href=\"#18\">Introducing Sequential Feature Selection<\/a><\/li>\r\n         <li><a href=\"#19\">Generate a Data Set<\/a><\/li>\r\n         <li><a href=\"#25\">Run a Simulation<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <h3>Create a Dataset<a name=\"1\"><\/a><\/h3>\r\n   <p>To start with, I am going to generate a dataset that is deliberately engineered to be very difficult to fit using traditional\r\n      linear regression.  The dataset has three characteristics that present challenges of linear regression\r\n   <\/p>\r\n   <div>\r\n      <ol>\r\n         <li>The dataset is very \"wide\".  There are relatively large numbers of (possible) predictors compared to the number of observations.<\/li>\r\n         <li>The predictors are strongly correlated with one another<\/li>\r\n         <li>There is a large amount of noise in the dataset<\/li>\r\n      <\/ol>\r\n   <\/div><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">clear <span style=\"color: #A020F0\">all<\/span>\r\nclc\r\nrng(1998);<\/pre><p>Create eight <tt>X<\/tt> variables. The mean of each variable will be equal to zero.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">mu = [0 0 0 0 0 0 0 0];<\/pre><p>The variables are correlated with one another. The covariance matrix is specified as<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">i = 1:8;\r\nmatrix = abs(bsxfun(@minus,i',i));\r\ncovariance = repmat(.5,8,8).^matrix;<\/pre><p>Use these parameters to generate a set of multivariate normal random numbers.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">X = mvnrnd(mu, covariance, 20);<\/pre><p>Create a hyperplane that describes <tt>Y = f(X)<\/tt> and add in a noise vector. Note that only three of the eight predictors have coefficient values &gt; 0. This means that five\r\n      of the eight predictors have zero impact on the actual model.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">Beta = [3; 1.5; 0; 0; 2; 0; 0; 0];\r\nds = dataset(Beta);\r\nY = X * Beta + 3 * randn(20,1);<\/pre><p>For those who are interested, this example was motivated by Robert <a title=\"http:\/\/www-stat.stanford.edu\/~tibs\/lasso\/lasso.pdf (link no longer works)\">Tibshirani's classic 1996 paper on the lasso<\/a>:\r\n   <\/p>\r\n   <p>Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages\r\n      267-288).\r\n   <\/p>\r\n   <p>Let's model <tt>Y = f(X)<\/tt> using ordinary least squares regression.  We'll start by using traditional linear regression to model <tt>Y = f(X)<\/tt> and compare the coefficients that were estimated by the regression model to the known vector <tt>Beta<\/tt>.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">b = regress(Y,X);\r\nds.Linear = b;\r\ndisp(ds)<\/pre><pre style=\"font-style:oblique\">    Beta    Linear  \r\n      3       3.5819\r\n    1.5      0.44611\r\n      0      0.92272\r\n      0     -0.84134\r\n      2       4.7091\r\n      0      -2.5195\r\n      0      0.17123\r\n      0     -0.42067\r\n<\/pre><p>As expected, the regression model is performing very poorly.<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>The regression is estimating coefficient values for all eight predictors.  Objectively, we \"know\" that five of these coefficients\r\n            should be zero.\r\n         <\/li>\r\n         <li>Furthermore, the coefficients that are being estimated are very far removed from the true value.<\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>Let's see if we can improve on these results using a technique called \"All possible subset regression\". First, I'm going to\r\n      generate 255 different regession models which encompass every possible combinate of the eight predictors. Next, I am going\r\n      to measure the accuracy of the resulting regression using the cross validated mean squared error. Finally, I'm going to select\r\n      the regression model that produced the most accurate results and compare this to our original regression model.\r\n   <\/p>\r\n   <p>Create an index for the regression subsets.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">index = dec2bin(1:255);\r\nindex = index == <span style=\"color: #A020F0\">'1'<\/span>;\r\nresults = double(index);\r\nresults(:,9) = zeros(length(results),1);<\/pre><p>Generate the regression models.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #0000FF\">for<\/span> i = 1:length(index)\r\n    foo = index(i,:);\r\n    rng(1981);\r\n    regf = @(XTRAIN, YTRAIN, XTEST)(XTEST*regress(YTRAIN,XTRAIN));\r\n    results(i,9) = crossval(<span style=\"color: #A020F0\">'mse'<\/span>, X(:,foo), Y,<span style=\"color: #A020F0\">'kfold'<\/span>, 5, <span style=\"color: #A020F0\">'predfun'<\/span>,regf);\r\n<span style=\"color: #0000FF\">end<\/span><\/pre><p>Sort the outputs by MSE.  I'll show you the first few.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">index = sortrows(results, 9);\r\nindex(1:10,:)<\/pre><pre style=\"font-style:oblique\">ans =\r\n  Columns 1 through 5\r\n            1            1            0            0            1\r\n            1            1            0            1            1\r\n            1            0            1            0            1\r\n            1            1            0            0            1\r\n            1            0            0            0            1\r\n            1            1            0            1            1\r\n            1            0            0            1            1\r\n            1            0            0            0            1\r\n            1            0            1            0            1\r\n            1            0            0            1            1\r\n  Columns 6 through 9\r\n            1            0            0       10.239\r\n            1            0            0       10.533\r\n            1            0            0       10.615\r\n            0            0            0       10.976\r\n            1            0            0       11.062\r\n            0            0            0       11.257\r\n            1            0            0       11.641\r\n            0            0            0       11.938\r\n            0            0            0       12.068\r\n            0            0            0       12.249\r\n<\/pre><h3>Compare the Results<a name=\"13\"><\/a><\/h3><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">beta = regress(Y, X(:,logical(index(1,1:8))));\r\nSubset = zeros(8,1);\r\nds.Subset = Subset;\r\nds.Subset(logical(index(1,1:8))) = beta;\r\ndisp(ds)<\/pre><pre style=\"font-style:oblique\">    Beta    Linear      Subset \r\n      3       3.5819     3.4274\r\n    1.5      0.44611     1.1328\r\n      0      0.92272          0\r\n      0     -0.84134          0\r\n      2       4.7091     4.1048\r\n      0      -2.5195    -2.0063\r\n      0      0.17123          0\r\n      0     -0.42067          0\r\n<\/pre><p>The results are rather striking.<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Our new model only includes 4 predictors.  Variables 3,4, 7 and 8 have been eliminated from the regression model.  Furthermore,\r\n            the variables that were exluded are ones that we \"know\" should be zeroed out.\r\n         <\/li>\r\n         <li>Equally significant, the coefficients that are being estimated from the resulting regression model are much closer to the\r\n            \"true\" values.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>All possible subset regression appears to have generated a significantly better model.  This <tt>R^2<\/tt> value for this regression model isn't as good as the original linear regression; however, if we're trying to generate predictions\r\n      from a new data set, we'd expect this model to perform significantly better.\r\n   <\/p>\r\n   <h3>Perform a Simulation<a name=\"15\"><\/a><\/h3>\r\n   <p>It's always dangerous to rely on the results of a single observation.  If we were to change either of the noise vectors that\r\n      we used to generate our original dataset we might end up with radically different results. This next block of code will repeat\r\n      our original experiment a hundred times, each time adding in different noise vectors.  Our output will be a bar chart that\r\n      shows how frequently each of the different variables was included in the resulting model.  I'm using parallel computing to\r\n      speed things up though you could perform this (more slowly) in serial.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">matlabpool <span style=\"color: #A020F0\">open<\/span>\r\n\r\nsumin = zeros(1,8);\r\n<span style=\"color: #0000FF\">parfor<\/span> j=1:100\r\n    mu = [0 0 0 0 0 0 0 0];\r\n    Beta = [3; 1.5; 0; 0; 2; 0; 0; 0];\r\n    i = 1:8;\r\n    matrix = abs(bsxfun(@minus,i',i));\r\n    covariance = repmat(.5,8,8).^matrix;\r\n    X = mvnrnd(mu, covariance, 20);\r\n    Y = X * Beta + 3 * randn(size(X,1),1);\r\n\r\n    index = dec2bin(1:255);\r\n    index = index == <span style=\"color: #A020F0\">'1'<\/span>;\r\n    results = double(index);\r\n    results(:,9) = zeros(length(results),1);\r\n\r\n    <span style=\"color: #0000FF\">for<\/span> k = 1:length(index)\r\n        foo = index(k,:);\r\n        regf = @(XTRAIN, YTRAIN, XTEST)(XTEST*regress(YTRAIN,XTRAIN));\r\n        results(k,9) = crossval(<span style=\"color: #A020F0\">'mse'<\/span>, X(:,foo), Y,<span style=\"color: #A020F0\">'kfold'<\/span>, 5, <span style=\"color: #0000FF\">...<\/span>\r\n            <span style=\"color: #A020F0\">'predfun'<\/span>,regf);\r\n    <span style=\"color: #0000FF\">end<\/span>\r\n\r\n<span style=\"color: #228B22\">% Sort the outputs by MSE<\/span>\r\nindex = sortrows(results, 9);\r\nsumin = sumin + index(1,1:8);\r\n<span style=\"color: #0000FF\">end<\/span>  <span style=\"color: #228B22\">% parfor<\/span>\r\n\r\nmatlabpool <span style=\"color: #A020F0\">close<\/span><\/pre><pre style=\"font-style:oblique\">Starting matlabpool using the 'local4' configuration ... connected to 3 labs.\r\nSending a stop signal to all the labs ... stopped.\r\n<\/pre><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">%Plot results.<\/span>\r\nbar(sumin)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/297\/featureSelection_01.png\"> <p>The results clearly show that \"All possible subset selection\" is doing a really good job identifying the key variables that\r\n      drive our model. Variables 1, 2, and 5 show up in most of the models.  Individual distractors work their way into the models\r\n      as well, however, their frequency is no where near as high as the real predictors.\r\n   <\/p>\r\n   <h3>Introducing Sequential Feature Selection<a name=\"18\"><\/a><\/h3>\r\n   <p>All possible subset selection suffers from one enormous limitation.  It's computationally infeasible to apply this technique\r\n      to datasets that contain large numbers of (potential) predictors.  Our first example had eight possible predictors.  Testing\r\n      all possible subsets required generate (2^8)-1 = 255 different models.  Suppose that we were evaluating a model with 20 possible\r\n      predictors.  All possible subsets would require use to test 1,048,575 different models.  With 40 predictors we'd be looking\r\n      at a billion different combinations.\r\n   <\/p>\r\n   <p>Sequential feature selection is a more modern approach that tries to define a smart path through the search space.  In turn,\r\n      this should allow us to identify a good set of cofficients but ensure that the problem is still computationally feasible.\r\n   <\/p>\r\n   <p>Sequential feature selection works as follows:<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Start by testing each possible predictor one at a time.  Identify the single predictor that generates the most accurate model.\r\n             This predictor is automatically added to the model.\r\n         <\/li>\r\n         <li>Next, one at a time, add each of the remaining predictors to a model that includes the single best variable.  Identify the\r\n            variable that improves the accuracy of the model the most.\r\n         <\/li>\r\n         <li>Test the two models for statistical significance.  If the new model is not significantly more accurate that the original model,\r\n            stop the process.  If, however, the new model is statistically more significant, go and search for the third best variable.\r\n         <\/li>\r\n         <li>Repeat this process until you can't identify a new variable that has a statistically significant impact on the model.<\/li>\r\n      <\/ul>\r\n   <\/div><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">% This next bit of code will show you how to use sequential feature<\/span>\r\n<span style=\"color: #228B22\">% selection to solve a significantly larger problem.<\/span><\/pre><h3>Generate a Data Set<a name=\"19\"><\/a><\/h3><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">clear <span style=\"color: #A020F0\">all<\/span>\r\nclc\r\nrng(1981);\r\n\r\nZ12 = randn(100,40);\r\nZ1 = randn(100,1);\r\nX = Z12 + repmat(Z1, 1,40);\r\nB0 = zeros(10,1);\r\nB1 = ones(10,1);\r\nBeta = 2 * vertcat(B0,B1,B0,B1);\r\nds = dataset(Beta);\r\n\r\nY = X * Beta + 15*randn(100,1);<\/pre><p>Use linear regression to fit the model.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">b = regress(Y,X);\r\nds.Linear = b;<\/pre><p>Use sequential feature selection to fit the model.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">opts = statset(<span style=\"color: #A020F0\">'display'<\/span>,<span style=\"color: #A020F0\">'iter'<\/span>);\r\n\r\nfun = @(x0,y0,x1,y1) norm(y1-x1*(x0\\y0))^2;  <span style=\"color: #228B22\">% residual sum of squares<\/span>\r\n[in,history] = sequentialfs(fun,X,Y,<span style=\"color: #A020F0\">'cv'<\/span>,5, <span style=\"color: #A020F0\">'options'<\/span>,opts)<\/pre><pre style=\"font-style:oblique\">Start forward sequential feature selection:\r\nInitial columns included:  none\r\nColumns that can not be included:  none\r\nStep 1, added column 26, criterion value 868.909\r\nStep 2, added column 31, criterion value 566.166\r\nStep 3, added column 9, criterion value 425.096\r\nStep 4, added column 35, criterion value 369.597\r\nStep 5, added column 16, criterion value 314.445\r\nStep 6, added column 5, criterion value 283.324\r\nStep 7, added column 18, criterion value 269.421\r\nStep 8, added column 15, criterion value 253.78\r\nStep 9, added column 40, criterion value 242.04\r\nStep 10, added column 27, criterion value 234.184\r\nStep 11, added column 12, criterion value 223.271\r\nStep 12, added column 17, criterion value 216.464\r\nStep 13, added column 14, criterion value 212.635\r\nStep 14, added column 22, criterion value 207.885\r\nStep 15, added column 7, criterion value 205.253\r\nStep 16, added column 33, criterion value 204.179\r\nFinal columns included:  5 7 9 12 14 15 16 17 18 22 26 27 31 33 35 40 \r\nin =\r\n  Columns 1 through 12\r\n     0     0     0     0     1     0     1     0     1     0     0     1\r\n  Columns 13 through 24\r\n     0     1     1     1     1     1     0     0     0     1     0     0\r\n  Columns 25 through 36\r\n     0     1     1     0     0     0     1     0     1     0     1     0\r\n  Columns 37 through 40\r\n     0     0     0     1\r\nhistory = \r\n      In: [16x40 logical]\r\n    Crit: [1x16 double]\r\n<\/pre><p>I've configured <tt>sequentialfs<\/tt> to show each iteration of the feature selection process.  You can see each of the variables as it gets added to the model,\r\n      along with the change in the cross validated mean squared error.\r\n   <\/p>\r\n   <p>Generate a linear regression using the optimal set of features<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">beta = regress(Y,X(:,in));\r\n\r\nds.Sequential = zeros(length(ds),1);\r\nds.Sequential(in) = beta<\/pre><pre style=\"font-style:oblique\">ds = \r\n    Beta    Linear      Sequential\r\n    0       -0.74167          0   \r\n    0       -0.66373          0   \r\n    0         1.2025          0   \r\n    0        -1.7087          0   \r\n    0         1.6493     2.5441   \r\n    0         1.3721          0   \r\n    0        -1.6411    -1.1617   \r\n    0        -3.5482          0   \r\n    0         4.1643     2.9544   \r\n    0        -1.9067          0   \r\n    2        0.28763          0   \r\n    2         1.7454     2.9983   \r\n    2         0.1005          0   \r\n    2         2.2226     3.5806   \r\n    2         4.4912     4.0527   \r\n    2         4.7129     5.3879   \r\n    2         0.9606     1.9982   \r\n    2         5.9834     5.1789   \r\n    2         3.9699          0   \r\n    2        0.94454          0   \r\n    0          2.861          0   \r\n    0        -2.1236    -2.4748   \r\n    0        -1.6324          0   \r\n    0         1.8677          0   \r\n    0       -0.51398          0   \r\n    0         3.6856     4.0398   \r\n    0        -3.3103    -4.1396   \r\n    0       -0.98227          0   \r\n    0         1.4756          0   \r\n    0       0.026193          0   \r\n    2         2.9373     3.8792   \r\n    2        0.63056          0   \r\n    2         2.3509     1.9472   \r\n    2         0.8035          0   \r\n    2         2.9242       4.35   \r\n    2         0.4794          0   \r\n    2        -0.6632          0   \r\n    2         0.6066          0   \r\n    2       -0.70758          0   \r\n    2         4.4814     3.8164   \r\n<\/pre><p>As you can see, <tt>sequentialfs<\/tt> has generated a much more parsimonious model than the ordinary least squares regression.  We also have a model that should\r\n      generate significantly more accurate predictions.  Moreover, if we needed to (hypothetically) extract the five most important\r\n      variables we could simple grab the output from Steps 1:5.\r\n   <\/p>\r\n   <h3>Run a Simulation<a name=\"25\"><\/a><\/h3>\r\n   <p>As a last step, we'll once again run a simulation to benchmark the technique.  Here, once again, we can see that the feature\r\n      selection technique is doing a very good job identifying the true predictors while screening out the distractors.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">matlabpool <span style=\"color: #A020F0\">open<\/span>\r\n\r\nsumin = 0;\r\n<span style=\"color: #0000FF\">parfor<\/span> j=1:100\r\n    Z12 = randn(100,40);\r\n    Z1 = randn(100,1);\r\n    X = Z12 + repmat(Z1, 1,40);\r\n    B0 = zeros(10,1);\r\n    B1 = ones(10,1);\r\n    Beta = 2 * vertcat(B0,B1,B0,B1);\r\n    Y = X * Beta + 15*randn(100,1);\r\n\r\n    fun = @(x0,y0,x1,y1) norm(y1-x1*(x0\\y0))^2;  <span style=\"color: #228B22\">% residual sum of squares<\/span>\r\n    [in,history] = sequentialfs(fun,X,Y,<span style=\"color: #A020F0\">'cv'<\/span>,5);\r\n\r\n    sumin = in + sumin;\r\n<span style=\"color: #0000FF\">end<\/span>\r\n\r\nbar(sumin)\r\n\r\nmatlabpool <span style=\"color: #A020F0\">close<\/span><\/pre><pre style=\"font-style:oblique\">Starting matlabpool using the 'local4' configuration ... connected to 3 labs.\r\nSending a stop signal to all the labs ... stopped.\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/297\/featureSelection_02.png\"> <p>One last comment:  As you can see, neither <tt>sequentialfs<\/tt> nor all possible subset selection is perfect.  Neither of these techniques was able to capture all of the true variables\r\n      while excluding the distractors.  It's important to recognize that these data sets were deliberately designed to be difficult\r\n      to fit.  [We don't want to benchmark performance using \"easy\" problems where any old technique might succeed.]\r\n   <\/p>\r\n   <p>Next week's blog post will focus on a radically different way to solve these same types of problems.  Stay tuned for a look\r\n      at regularization techniques such as ridge regression, lasso and the elastic net.\r\n   <\/p>\r\n   <p>Here are a couple questions that you might want to consider: # All possible subset methods are O(2^n).  What is the order\r\n      of complexity of sequential feature selection? # Other than scaling issues, can you think of any possible problems with sequential\r\n      feature selection?  Are there situations where the algorithm might miss important variables?\r\n   <\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_e359acae97de460f9c0dbca9011fcebe() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='e359acae97de460f9c0dbca9011fcebe ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' e359acae97de460f9c0dbca9011fcebe';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Richard Willey';\r\n        copyright = 'Copyright 2011 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_e359acae97de460f9c0dbca9011fcebe()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.13<br><\/p>\r\n<\/div>\r\n<!--\r\ne359acae97de460f9c0dbca9011fcebe ##### SOURCE BEGIN #####\r\n%% Subset Selection and Regularization  \r\n% _This week Richard Willey from technical marketing will be guest blogging\r\n% about subset selection and regularization._ This week's blog posting is\r\n% motivated by a pair of common challenges that occur in applied curve\r\n% fitting.\r\n%\r\n% * The first challenge is how best to create accurate predictive models when\r\n% your independent variables exhibit strong correlation.  In many cases you\r\n% can improve upon the results of an ordinary least square regression if\r\n% you reduce the number of predictors or, alternatively, shrink the\r\n% coefficient values towards zero.\r\n% * The second challenge involves generating an accurate predictive model\r\n% when you are constrained with respect to the number of predictors you are\r\n% working with.  For example, assume that you need to embed your model onto\r\n% a controller.  You have 20 possible predictors to chose from, but you\r\n% only have enough memory to allow for 8 independent variables.\r\n%\r\n% This blog post will show two different sets of techniques to solve \r\n% these related challenges.  The first set of techniques are based on a\r\n% combination of feature selection and cross validation.  The second set of\r\n% techniques are use regularization algorithms like ridge regression, lasso\r\n% and the elastic net. All of these algorithms can be found in \r\n% <https:\/\/www.mathworks.com\/products\/statistics\/ Statistics Toolbox>.\r\n\r\n%% Create a Dataset\r\n% To start with, I am going to generate a dataset that is deliberately\r\n% engineered to be very difficult to fit using traditional linear\r\n% regression.  The dataset has three characteristics that present\r\n% challenges of linear regression\r\n%\r\n% # The dataset is very \"wide\".  There are relatively large numbers of\r\n% (possible) predictors compared to the number of observations.\r\n% # The predictors are strongly correlated with one another\r\n% # There is a large amount of noise in the dataset\r\n\r\nclear all\r\nclc\r\nrng(1998);\r\n\r\n%%\r\n% Create eight |X| variables. The mean of each variable will be equal to\r\n% zero.\r\nmu = [0 0 0 0 0 0 0 0];\r\n\r\n%%\r\n% The variables are correlated with one another. The covariance matrix is\r\n% specified as\r\ni = 1:8;\r\nmatrix = abs(bsxfun(@minus,i',i));\r\ncovariance = repmat(.5,8,8).^matrix;\r\n\r\n%%\r\n% Use these parameters to generate a set of multivariate normal \r\n% random numbers.\r\nX = mvnrnd(mu, covariance, 20);\r\n\r\n%%\r\n% Create a hyperplane that describes |Y = f(X)| and add in a noise vector.\r\n% Note that only three of the eight predictors have coefficient values > 0.\r\n% This means that five of the eight predictors have zero impact on the\r\n% actual model.\r\nBeta = [3; 1.5; 0; 0; 2; 0; 0; 0];\r\nds = dataset(Beta);\r\nY = X * Beta + 3 * randn(20,1);\r\n\r\n%%\r\n% For those who are interested, this example was motivated by Robert\r\n% <http:\/\/www-stat.stanford.edu\/~tibs\/lasso\/lasso.pdf Tibshirani's classic\r\n% 1996 paper on the lasso>: \r\n%   \r\n% Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. \r\n% J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288).\r\n\r\n%% \r\n% Let's model |Y = f(X)| using ordinary least squares regression.  We'll\r\n% start by using traditional linear regression to model |Y = f(X)| and\r\n% compare the coefficients that were estimated by the regression model to\r\n% the known vector |Beta|.\r\n\r\nb = regress(Y,X);\r\nds.Linear = b;\r\ndisp(ds)\r\n%%\r\n% As expected, the regression model is performing very poorly.  \r\n%\r\n% * The regression is estimating coefficient values for all eight\r\n% predictors.  Objectively, we \"know\" that five of these coefficients\r\n% should be zero.\r\n% * Furthermore, the coefficients that are being estimated are very far\r\n% removed from the true value.\r\n%%\r\n% Let's see if we can improve on these results using a technique called\r\n% \"All possible subset regression\". First, I'm going to generate 255\r\n% different regession models which encompass every possible combinate of\r\n% the eight predictors. Next, I am going to measure the accuracy of the\r\n% resulting regression using the cross validated mean squared error.\r\n% Finally, I'm going to select the regression model that produced the most\r\n% accurate results and compare this to our original regression model.\r\n%%\r\n% Create an index for the regression subsets.\r\nindex = dec2bin(1:255);\r\nindex = index == '1';\r\nresults = double(index);\r\nresults(:,9) = zeros(length(results),1);\r\n%%\r\n% Generate the regression models.\r\nfor i = 1:length(index)\r\n    foo = index(i,:);\r\n    rng(1981);\r\n    regf = @(XTRAIN, YTRAIN, XTEST)(XTEST*regress(YTRAIN,XTRAIN));\r\n    results(i,9) = crossval('mse', X(:,foo), Y,'kfold', 5, 'predfun',regf);\r\nend\r\n%%\r\n% Sort the outputs by MSE.  I'll show you the first few.\r\nindex = sortrows(results, 9);\r\nindex(1:10,:)\r\n\r\n%% Compare the Results\r\n\r\nbeta = regress(Y, X(:,logical(index(1,1:8))));\r\nSubset = zeros(8,1);\r\nds.Subset = Subset;\r\nds.Subset(logical(index(1,1:8))) = beta;\r\ndisp(ds)\r\n\r\n%%\r\n% The results are rather striking.\r\n%\r\n% * Our new model only includes 4 predictors.  Variables 3,4, 7 and 8 have\r\n% been eliminated from the regression model.  Furthermore, the variables\r\n% that were exluded are ones that we \"know\" should be zeroed out.\r\n% * Equally significant, the coefficients that are being estimated from the\r\n% resulting regression model are much closer to the \"true\" values.\r\n%\r\n% All possible subset regression appears to have generated a significantly\r\n% better model.  This |R^2| value for this regression model isn't as good\r\n% as the original linear regression; however, if we're trying to generate\r\n% predictions from a new data set, we'd expect this model to perform\r\n% significantly better.\r\n\r\n%% Perform a Simulation\r\n% It's always dangerous to rely on the results of a single observation.  If\r\n% we were to change either of the noise vectors that we used to generate\r\n% our original dataset we might end up with radically different results.\r\n% This next block of code will repeat our original experiment a hundred\r\n% times, each time adding in different noise vectors.  Our output will be a\r\n% bar chart that shows how frequently each of the different variables was\r\n% included in the resulting model.  I'm using parallel computing to speed\r\n% things up though you could perform this (more slowly) in serial.\r\n\r\nmatlabpool open\r\n\r\nsumin = zeros(1,8);\r\nparfor j=1:100\r\n    mu = [0 0 0 0 0 0 0 0];\r\n    Beta = [3; 1.5; 0; 0; 2; 0; 0; 0];\r\n    i = 1:8;\r\n    matrix = abs(bsxfun(@minus,i',i));\r\n    covariance = repmat(.5,8,8).^matrix;\r\n    X = mvnrnd(mu, covariance, 20);    \r\n    Y = X * Beta + 3 * randn(size(X,1),1);\r\n    \r\n    index = dec2bin(1:255);\r\n    index = index == '1';\r\n    results = double(index);\r\n    results(:,9) = zeros(length(results),1);\r\n    \r\n    for k = 1:length(index)\r\n        foo = index(k,:);\r\n        regf = @(XTRAIN, YTRAIN, XTEST)(XTEST*regress(YTRAIN,XTRAIN));\r\n        results(k,9) = crossval('mse', X(:,foo), Y,'kfold', 5, ...\r\n            'predfun',regf);\r\n    end\r\n    \r\n% Sort the outputs by MSE\r\nindex = sortrows(results, 9);    \r\nsumin = sumin + index(1,1:8);\r\nend  % parfor\r\n\r\nmatlabpool close\r\n%%\r\n%Plot results.\r\nbar(sumin)\r\n%%\r\n% The results clearly show that \"All possible subset selection\" is doing a\r\n% really good job identifying the key variables that drive our model.\r\n% Variables 1, 2, and 5 show up in most of the models.  Individual\r\n% distractors work their way into the models as well, however, their\r\n% frequency is no where near as high as the real predictors.\r\n\r\n%% Introducing Sequential Feature Selection\r\n% All possible subset selection suffers from one enormous limitation.  It's\r\n% computationally infeasible to apply this technique to datasets that\r\n% contain large numbers of (potential) predictors.  Our first example had\r\n% eight possible predictors.  Testing all possible subsets required\r\n% generate (2^8)-1 = 255 different models.  Suppose that we were evaluating\r\n% a model with 20 possible predictors.  All possible subsets would require\r\n% use to test 1,048,575 different models.  With 40 predictors we'd be looking\r\n% at a billion different combinations.\r\n%\r\n% Sequential feature selection is a more modern approach that tries to\r\n% define a smart path through the search space.  In turn, this should allow\r\n% us to identify a good set of cofficients but ensure that the problem is\r\n% still computationally feasible.\r\n%\r\n% Sequential feature selection works as follows:\r\n%\r\n% * Start by testing each possible predictor one at a time.  Identify the\r\n% single predictor that generates the most accurate model.  This predictor\r\n% is automatically added to the model.\r\n% * Next, one at a time, add each of the remaining predictors to a model\r\n% that includes the single best variable.  Identify the variable that\r\n% improves the accuracy of the model the most.\r\n% * Test the two models for statistical significance.  If the new model is\r\n% not significantly more accurate that the original model, stop the\r\n% process.  If, however, the new model is statistically more significant,\r\n% go and search for the third best variable.\r\n% * Repeat this process until you can't identify a new variable that has a\r\n% statistically significant impact on the model.\r\n\r\n% This next bit of code will show you how to use sequential feature\r\n% selection to solve a significantly larger problem.\r\n\r\n%% Generate a Data Set\r\n\r\nclear all\r\nclc\r\nrng(1981);\r\n\r\nZ12 = randn(100,40);\r\nZ1 = randn(100,1);\r\nX = Z12 + repmat(Z1, 1,40);\r\nB0 = zeros(10,1);\r\nB1 = ones(10,1);\r\nBeta = 2 * vertcat(B0,B1,B0,B1);\r\nds = dataset(Beta);\r\n\r\nY = X * Beta + 15*randn(100,1);\r\n%%\r\n% Use linear regression to fit the model.\r\nb = regress(Y,X);\r\nds.Linear = b;\r\n\r\n%%\r\n% Use sequential feature selection to fit the model.\r\nopts = statset('display','iter');\r\n\r\nfun = @(x0,y0,x1,y1) norm(y1-x1*(x0\\y0))^2;  % residual sum of squares\r\n[in,history] = sequentialfs(fun,X,Y,'cv',5, 'options',opts)\r\n\r\n%%\r\n% I've configured |sequentialfs| to show each iteration of the feature\r\n% selection process.  You can see each of the variables as it gets added to\r\n% the model, along with the change in the cross validated mean squared\r\n% error.\r\n\r\n%%\r\n% Generate a linear regression using the optimal set of features\r\nbeta = regress(Y,X(:,in));\r\n\r\nds.Sequential = zeros(length(ds),1);\r\nds.Sequential(in) = beta\r\n%%\r\n% As you can see, |sequentialfs| has generated a much more parsimonious\r\n% model than the ordinary least squares regression.  We also have a model\r\n% that should generate significantly more accurate predictions.  Moreover,\r\n% if we needed to (hypothetically) extract the five most important\r\n% variables we could simple grab the output from Steps 1:5.\r\n\r\n%% Run a Simulation\r\n% As a last step, we'll once again run a simulation to benchmark the\r\n% technique.  Here, once again, we can see that the feature selection\r\n% technique is doing a very good job identifying the true predictors while\r\n% screening out the distractors.\r\n\r\nmatlabpool open\r\n\r\nsumin = 0;\r\nparfor j=1:100   \r\n    Z12 = randn(100,40);\r\n    Z1 = randn(100,1);\r\n    X = Z12 + repmat(Z1, 1,40);\r\n    B0 = zeros(10,1);\r\n    B1 = ones(10,1);\r\n    Beta = 2 * vertcat(B0,B1,B0,B1);\r\n    Y = X * Beta + 15*randn(100,1);\r\n    \r\n    fun = @(x0,y0,x1,y1) norm(y1-x1*(x0\\y0))^2;  % residual sum of squares\r\n    [in,history] = sequentialfs(fun,X,Y,'cv',5);\r\n    \r\n    sumin = in + sumin;\r\nend\r\n\r\nbar(sumin)\r\n\r\nmatlabpool close\r\n%%\r\n% One last comment:  As you can see, neither |sequentialfs| nor all\r\n% possible subset selection is perfect.  Neither of these techniques was\r\n% able to capture all of the true variables while excluding the\r\n% distractors.  It's important to recognize that these data sets were\r\n% deliberately designed to be difficult to fit.  [We don't want to\r\n% benchmark performance using \"easy\" problems where any old technique might\r\n% succeed.]\r\n%%\r\n% Next week's blog post will focus on a radically different way to solve\r\n% these same types of problems.  Stay tuned for a look at regularization\r\n% techniques such as ridge regression, lasso and the elastic net.\r\n%%\r\n% Here are a couple questions that you might want to consider:\r\n% # All possible subset methods are O(2^n).  What is the order of \r\n% complexity of sequential feature selection?\r\n% # Other than scaling issues, can you think of any possible problems with\r\n% sequential feature selection?  Are there situations where the algorithm\r\n% might miss important variables?\r\n\r\n\r\n##### SOURCE END ##### e359acae97de460f9c0dbca9011fcebe\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   \r\n      This week Richard Willey from technical marketing will be guest blogging about subset selection and regularization. This week's blog posting is motivated by a pair of common challenges... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/21\/subset-selection-and-regularization\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[47,48],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/297"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=297"}],"version-history":[{"count":0,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/297\/revisions"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=297"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=297"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}