{"id":298,"date":"2011-11-29T14:41:05","date_gmt":"2011-11-29T14:41:05","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/29\/subset-selection-and-regularization-part-2\/"},"modified":"2016-08-04T08:45:46","modified_gmt":"2016-08-04T13:45:46","slug":"subset-selection-and-regularization-part-2","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/29\/subset-selection-and-regularization-part-2\/","title":{"rendered":"Subset Selection and Regularization (Part 2)"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <introduction>\r\n      <p><i>This week Richard Willey from technical marketing will finish his two part presentation on subset selection and regularization.<\/i><\/p>\r\n      <p>In a <a href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/21\/subset-selection-and-regularization\/\">recent posting<\/a>, we examined how to use sequential feature selection to improve predictive accuracy when modeling wide data sets with highly\r\n         correlated variables.  This week, we're going to solve the same problems using regularization algorithms such as lasso, the\r\n         elastic net, and ridge regression.  Mathematically, these algorithms work by penalizing the size of the regression coefficients\r\n         in the model.\r\n      <\/p>\r\n      <p>Standard linear regression works by estimating a set of coefficients that minimize the sum of the squared error between the\r\n         observed values and the fitted values from the model.  Regularization techniques like ridge regression, lasso, and the elastic\r\n         net introduce an additional term to this minimization problem.\r\n      <\/p>\r\n      <div>\r\n         <ul>\r\n            <li>Ridge regression identifies a set of regression coefficients that minimize the sum of the squared errors plus the sum of the\r\n               squared regression coefficients multiplied by a weight parameter <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\"> . <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  can take any value between zero and one.  A <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  value of zero is equivalent to a standard linear regression.  As <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  increases in size, regression coefficients shrink towards zero.\r\n            <\/li>\r\n            <li>Lasso minimizes the sum of the squared errors plus the sum of the absolute value of the regression coefficients.<\/li>\r\n            <li>The elastic net is a weighted average of the lasso and the ridge solutions.<\/li>\r\n         <\/ul>\r\n      <\/div>\r\n      <p>The introduction of this additional term forces the regression coefficients towards zero generating a simpler model with greater\r\n         predictive accuracy.\r\n      <\/p>\r\n      <p>Let's see regularization in action by using lasso to solve the same problem we looked at last week.<\/p>\r\n   <\/introduction>\r\n   <h3>Contents<\/h3>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"#1\">Recreate Data Set 1 from the Previous Post<\/a><\/li>\r\n         <li><a href=\"#2\">Use Lasso to Fit the Model<\/a><\/li>\r\n         <li><a href=\"#4\">Create a Plot Showing Mean Square Error Versus Lambda<\/a><\/li>\r\n         <li><a href=\"#5\">Use the Stats Structure to Extract a Set of Model Coefficients.<\/a><\/li>\r\n         <li><a href=\"#6\">Run a Simulation<\/a><\/li>\r\n         <li><a href=\"#8\">Choosing the Best Technique<\/a><\/li>\r\n         <li><a href=\"#9\">Conclusion<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <h3>Recreate Data Set 1 from the Previous Post<a name=\"1\"><\/a><\/h3><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">clear <span style=\"color: #A020F0\">all<\/span>\r\nclc\r\nrng(1998);\r\nmu = [0 0 0 0 0 0 0 0];\r\ni = 1:8;\r\nmatrix = abs(bsxfun(@minus,i',i));\r\ncovariance = repmat(.5,8,8).^matrix;\r\nX = mvnrnd(mu, covariance, 20);\r\nBeta = [3; 1.5; 0; 0; 2; 0; 0; 0];\r\nds = dataset(Beta);\r\nY = X * Beta + 3 * randn(20,1);\r\nb = regress(Y,X);\r\nds.Linear = b;<\/pre><h3>Use Lasso to Fit the Model<a name=\"2\"><\/a><\/h3>\r\n   <p>The syntax for the <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/stats\/lasso.html\"><tt>lasso<\/tt><\/a> command is very similar to that used by linear regression. In this line of code, I am going estimate a set of coefficients\r\n      <tt>B<\/tt> that models <tt>Y<\/tt> as a function of <tt>X<\/tt>.  To avoid over fitting, I'm going to apply five-fold cross validation.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">[B Stats] = lasso(X,Y, <span style=\"color: #A020F0\">'CV'<\/span>, 5);<\/pre><p>When we perform a linear regression, we generate a single set of regression coefficients.  By default <tt>lasso<\/tt> will create 100 different models.  Each model was estimated using a slightly larger <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\"> . All of the model coefficients are stored in array <tt>B<\/tt>.  The rest of the information about the model is stored in a structure named <tt>Stats<\/tt>.\r\n   <\/p>\r\n   <p>Let's look at the first five sets of coefficients inside of <tt>B<\/tt>.  As you traverse the rows you can see that as <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  increases, the value model coefficients are usually shrinking towards zero.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">disp(B(:,1:5))\r\ndisp(Stats)<\/pre><pre style=\"font-style:oblique\">       3.9147       3.9146       3.9145       3.9143       3.9142\r\n      0.13502      0.13498      0.13494      0.13488      0.13482\r\n      0.85283      0.85273      0.85262      0.85247      0.85232\r\n     -0.92775     -0.92723      -0.9267       -0.926     -0.92525\r\n       3.9415       3.9409       3.9404       3.9397       3.9389\r\n      -2.2945       -2.294      -2.2936       -2.293      -2.2924\r\n       1.3566       1.3567       1.3568       1.3569       1.3571\r\n     -0.14796     -0.14803      -0.1481     -0.14821     -0.14833\r\n         Intercept: [1x100 double]\r\n            Lambda: [1x100 double]\r\n             Alpha: 1\r\n                DF: [1x100 double]\r\n               MSE: [1x100 double]\r\n    PredictorNames: {}\r\n                SE: [1x100 double]\r\n      LambdaMinMSE: 0.585\r\n         Lambda1SE: 1.6278\r\n       IndexMinMSE: 78\r\n          Index1SE: 89\r\n<\/pre><h3>Create a Plot Showing Mean Square Error Versus Lambda<a name=\"4\"><\/a><\/h3>\r\n   <p>The natural question to ask at this point in time is \"OK, which of these 100 different models should I use?\".  We can answer\r\n      that question using <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/stats\/lassoplot.html\"><tt>lassoPlot<\/tt><\/a>. <tt>lassoPlot<\/tt> generates a plot that displays the relationship between <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  and the cross validated mean square error (MSE) of the resulting model.  Each of the red dots shows the MSE for the resulting\r\n      model.  The vertical line segments stretching out from each dot are error bars for each estimate.\r\n   <\/p>\r\n   <p>You can also see a pair of vertical dashed lines.  The line on the right identifies the <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  value that minimizes the cross validated MSE. The line on the left indicates the highest value of <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  whose MSE is within one standard error of the minimum MSE.  In general, people will chose the <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  that minimizes the MSE.  On occasion, if a more parsimonious model is considered particularly advantageous, a user might\r\n      choose some other <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  value that falls between the two line segments.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">lassoPlot(B, Stats, <span style=\"color: #A020F0\">'PlotType'<\/span>, <span style=\"color: #A020F0\">'CV'<\/span>)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_01.png\"> <h3>Use the Stats Structure to Extract a Set of Model Coefficients.<a name=\"5\"><\/a><\/h3>\r\n   <p>The <img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/298\/Regularization_eq40602.png\">  value that minimizes the MSE is stored in the <tt>Stats<\/tt> structure.  You can use this information to index into <tt>Beta<\/tt> and extract the set of coefficients that minimize the MSE.\r\n   <\/p>\r\n   <p>Much as in the feature selection example, we can see that the lasso algorithm has eliminated four of the five distractors\r\n      from the resulting model.  This new, more parsimonious model will be significantly more accurate for prediction than a standard\r\n      linear regression.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">ds.Lasso = B(:,Stats.IndexMinMSE);\r\ndisp(ds)<\/pre><pre style=\"font-style:oblique\">    Beta    Linear      Lasso   \r\n      3       3.5819      3.0591\r\n    1.5      0.44611      0.3811\r\n      0      0.92272    0.024131\r\n      0     -0.84134           0\r\n      2       4.7091      1.5654\r\n      0      -2.5195           0\r\n      0      0.17123      1.3499\r\n      0     -0.42067           0\r\n<\/pre><h3>Run a Simulation<a name=\"6\"><\/a><\/h3>\r\n   <p>Here, once again, it's very dangerous to base any kind of analysis on a single observation.  Let's use a simulation to compare\r\n      the accuracy of a linear regression with the lasso.  We'll start by preallocating some variables.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">MSE = zeros(100,1);\r\nmse = zeros(100,1);\r\nCoeff_Num = zeros(100,1);\r\nBetas = zeros(8,100);\r\ncv_Reg_MSE = zeros(1,100);<\/pre><p>Next, we'll generate 100 different models and estimate the number of coefficients contained in the lasso model as well as\r\n      the difference in the cross validated MSE between a standard linear regression and the lasso model.\r\n   <\/p>\r\n   <p>As you can see, on average, the lasso model only contains 4.5 terms (the standard linear regression model includes 8).  More\r\n      importantly, the cross validated MSE for the linear regression model is about 30% larger than that generated from the <tt>lasso<\/tt>.  This is an incredibly powerful results.  The <tt>lasso<\/tt> algorithm is every bit as easy to apply as standard linear regression, however, it offers significant improvements in predictive\r\n      accuracy compared to regression.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">rng(1998);\r\n\r\n<span style=\"color: #0000FF\">for<\/span> i = 1 : 100\r\n\r\n    X = mvnrnd(mu, covariance, 20);\r\n    Y = X * Beta + randn(20,1);\r\n\r\n    [B Stats] = lasso(X,Y, <span style=\"color: #A020F0\">'CV'<\/span>, 5);\r\n    Betas(:,i) = B(:,Stats.IndexMinMSE) &gt; 0;\r\n    Coeff_Num(i) = sum(B(:,Stats.IndexMinMSE) &gt; 0);\r\n    MSE(i) = Stats.MSE(:, Stats.IndexMinMSE);\r\n\r\n    regf = @(XTRAIN, ytrain, XTEST)(XTEST*regress(ytrain,XTRAIN));\r\n    cv_Reg_MSE(i) = crossval(<span style=\"color: #A020F0\">'mse'<\/span>,X,Y,<span style=\"color: #A020F0\">'predfun'<\/span>,regf, <span style=\"color: #A020F0\">'kfold'<\/span>, 5);\r\n\r\n<span style=\"color: #0000FF\">end<\/span>\r\n\r\nNumber_Lasso_Coefficients = mean(Coeff_Num);\r\ndisp(Number_Lasso_Coefficients)\r\n\r\nMSE_Ratio = median(cv_Reg_MSE)\/median(MSE);\r\ndisp(MSE_Ratio)<\/pre><pre style=\"font-style:oblique\">         4.57\r\n       1.2831\r\n<\/pre><h3>Choosing the Best Technique<a name=\"8\"><\/a><\/h3>\r\n   <p>Regularization methods and feature selection techniques both have unique strengths and weaknesses.  Let's close this blog\r\n      posting with some practical guidance regarding pros and cons for the various techniques.\r\n   <\/p>\r\n   <p>Regularization techniques have two major advantages compared to feature selection.<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Regularization techniques are able to operate on much larger datasets than feature selection methods.  Lasso and ridge regression\r\n            can be applied to datasets that contains thousands - even tens of thousands of variables.  Even sequential feature selection\r\n            is usually too slow to cope with this many possible predictors.\r\n         <\/li>\r\n         <li>Regularization algorithms often generate more accurate predictive models than feature selection.  Regularization operates\r\n            over a continuous space while feature selection operates over a discrete space. As a result, regularization is often able\r\n            to fine tune the model and produce more accurate estimates.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>However, feature selection methods also have their advantages<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Regularization tehcniques are only available for a small number of model types.  Notably, regularization can be applied to\r\n            linear regression and logistic regression.  However, if you're working some other modeling technique - say a boosted decision\r\n            tree - you'll typically need to apply feature selection techiques.\r\n         <\/li>\r\n         <li>Feature selection is easier to understand and explain to third parties. Never underestimate the importance of being able to\r\n            describe your methods when sharing your results.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>With this said and done, each of the three regularization techniques also offers its own unique advantages and disadvantages.<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Because lasso uses an L1 norm it tends to force individual coefficient values completely towards zero.  As a result, lasso\r\n            works very well as a feature selection algorithm.  It quickly identifies a small number of key variables.\r\n         <\/li>\r\n         <li>In contrast, ridge regression uses an L2 norm for the coefficients (you're minimizing the sum of the squared errors).  Ridge\r\n            regression tends to spread coefficient shrinkage across a larger number of coefficients.  If you think that your model should\r\n            contain a large number of coefficients, ridge regression is probably a better choice than lasso.\r\n         <\/li>\r\n         <li>Last, but not least, we have the elastic net which is able to compensate for a very specific limitation of lasso.  Lasso is\r\n            unable to identify more predictors than you have coefficients.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>Let's assume that you are running a cancer research study.<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>You have genes sequences for 500 different cancer patients<\/li>\r\n         <li>You're trying to determine which of 15,000 different genes have a signficant impact on the progression of the disease.<\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>Sequential feature selection is completely impractical with this many different variables.  You can't use ridge regression\r\n      because it won't force coefficients completely to zero quickly enough. At the same time, you can't use lasso since you might\r\n      need to identify more than 500 different genes.  The elastic net is one possible solution.\r\n   <\/p>\r\n   <h3>Conclusion<a name=\"9\"><\/a><\/h3>\r\n   <p>If you'd like more information on this topic there is a MathWork's webinar titled Computational Statistics:  Feature Selection, Regularization, and Shrinkage which provides a more detailed treatment of these topics.\r\n   <\/p>\r\n   <p>In closing, I'd like to ask you whether any of you have practical examples applying feature selection or regularization algorithms\r\n      in your work?\r\n   <\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Have you ever used feature selection?<\/li>\r\n         <li>Do you see an opportunity to apply lasso or ridge regression in your work?<\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>If so, please post here <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=298#respond\">here<\/a>.\r\n   <\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_70800e67086e4d37aeee1ea03b81b69f() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='70800e67086e4d37aeee1ea03b81b69f ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 70800e67086e4d37aeee1ea03b81b69f';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Richar Willey';\r\n        copyright = 'Copyright 2011 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_70800e67086e4d37aeee1ea03b81b69f()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.13<br><\/p>\r\n<\/div>\r\n<!--\r\n70800e67086e4d37aeee1ea03b81b69f ##### SOURCE BEGIN #####\r\n%% Subset Selection and Regularization (Part 2) \r\n% _This week Richard Willey from technical marketing will finish his two\r\n% part presentation on subset selection and regularization._  \r\n% \r\n% In a\r\n% <https:\/\/blogs.mathworks.com\/loren\/2011\/11\/21\/subset-selection-and-regularization\/\r\n% recent posting>, we examined how to use sequential feature selection to\r\n% improve predictive accuracy when modeling wide data sets with highly\r\n% correlated variables.  This week, we're going to solve the same problems\r\n% using regularization algorithms such as lasso, the elastic net, and ridge\r\n% regression.  Mathematically, these algorithms work by penalizing the size\r\n% of the regression coefficients in the model.\r\n%\r\n% Standard linear regression works by estimating a set of coefficients\r\n% that minimize the sum of the squared error between the observed \r\n% values and the fitted values from the model.  Regularization techniques \r\n% like ridge regression, lasso, and the elastic net introduce an \r\n% additional term to this minimization problem.  \r\n%\r\n% * Ridge regression identifies a set of regression coefficients that\r\n% minimize the sum of the squared errors plus the sum of the squared\r\n% regression coefficients multiplied by a weight parameter $$ \\lambda $$.\r\n% $$ \\lambda $$ can take any value between zero and one.  A $$ \\lambda $$\r\n% value of zero is equivalent to a standard linear regression.  As $$\r\n% \\lambda $$ increases in size, regression coefficients shrink towards\r\n% zero.\r\n% * Lasso minimizes the sum of the squared errors plus the sum of the\r\n% absolute value of the regression coefficients.\r\n% * The elastic net is a weighted average of the lasso and the ridge\r\n% solutions.\r\n%\r\n% The introduction of this additional term forces the regression\r\n% coefficients towards zero generating a simpler model with greater\r\n% predictive accuracy.  \r\n%\r\n% Let's see regularization in action by using lasso to solve the same\r\n% problem we looked at last week.\r\n \r\n%% Recreate Data Set 1 from the Previous Post  \r\n \r\nclear all\r\nclc\r\nrng(1998);\r\nmu = [0 0 0 0 0 0 0 0];\r\ni = 1:8;\r\nmatrix = abs(bsxfun(@minus,i',i));\r\ncovariance = repmat(.5,8,8).^matrix;\r\nX = mvnrnd(mu, covariance, 20);\r\nBeta = [3; 1.5; 0; 0; 2; 0; 0; 0];\r\nds = dataset(Beta);\r\nY = X * Beta + 3 * randn(20,1);\r\nb = regress(Y,X);\r\nds.Linear = b;\r\n \r\n%% Use Lasso to Fit the Model\r\n% The syntax for the\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/stats\/lasso.html |lasso|> command\r\n% is very similar to that used by linear regression. In this line of code,\r\n% I am going estimate a set of coefficients |B| that models |Y| as a\r\n% function of |X|.  To avoid over fitting, I'm going to apply five-fold\r\n% cross validation.\r\n[B Stats] = lasso(X,Y, 'CV', 5);\r\n \r\n%%\r\n% When we perform a linear regression, we generate a single set of\r\n% regression coefficients.  By default |lasso| will create 100 different\r\n% models.  Each model was estimated using a slightly larger $$ \\lambda $$. \r\n% All of the model coefficients are stored in array |B|.  The rest of the\r\n% information about the model is stored in a structure named |Stats|.\r\n%\r\n% Let's look at the first five sets of coefficients inside of |B|.  As you\r\n% traverse the rows you can see that as $$ \\lambda $$ increases, the value \r\n% model coefficients are usually shrinking towards zero.\r\ndisp(B(:,1:5))\r\ndisp(Stats)\r\n \r\n%% Create a Plot Showing Mean Square Error Versus |Lambda|\r\n% The natural question to ask at this point in time is \"OK, which of these\r\n% 100 different models should I use?\".  We can answer that question using\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/stats\/lassoplot.html |lassoPlot|>.\r\n% |lassoPlot| generates a plot that displays the relationship between $$\r\n% \\lambda $$ and the cross validated mean square error (MSE) of the\r\n% resulting model.  Each of the red dots shows the MSE for the resulting\r\n% model.  The vertical line segments stretching out from each dot are error\r\n% bars for each estimate.\r\n%\r\n% You can also see a pair of vertical dashed lines.  The line on the right\r\n% identifies the $$ \\lambda $$ value that minimizes the cross validated\r\n% MSE. The line on the left indicates the highest value of $$ \\lambda $$\r\n% whose MSE is within one standard error of the minimum MSE.  In general,\r\n% people will chose the $$ \\lambda $$ that minimizes the MSE.  On occasion,\r\n% if a more parsimonious model is considered particularly advantageous, a\r\n% user might choose some other $$ \\lambda $$ value that falls between the\r\n% two line segments.\r\nlassoPlot(B, Stats, 'PlotType', 'CV')\r\n \r\n%% Use the Stats Structure to Extract a Set of Model Coefficients.\r\n% The $$ \\lambda $$ value that minimizes the MSE is stored in the\r\n% |Stats| structure.  You can use this information to index into |Beta| and\r\n% extract the set of coefficients that minimize the MSE.\r\n%\r\n% Much as in the feature selection example, we can see that the lasso\r\n% algorithm has eliminated four of the five distractors from the resulting\r\n% model.  This new, more parsimonious model will be significantly more\r\n% accurate for prediction than a standard linear regression.\r\n%\r\nds.Lasso = B(:,Stats.IndexMinMSE);\r\ndisp(ds)\r\n \r\n%%  Run a Simulation\r\n% Here, once again, it's very dangerous to base any kind of analysis on a\r\n% single observation.  Let's use a simulation to compare the accuracy of a\r\n% linear regression with the lasso.  We'll start by preallocating some\r\n% variables.\r\nMSE = zeros(100,1);\r\nmse = zeros(100,1);\r\nCoeff_Num = zeros(100,1);\r\nBetas = zeros(8,100);\r\ncv_Reg_MSE = zeros(1,100);\r\n \r\n%%\r\n% Next, we'll generate 100 different models and estimate the number of\r\n% coefficients contained in the lasso model as well as the difference in\r\n% the cross validated MSE between a standard linear regression and the\r\n% lasso model.\r\n%\r\n% As you can see, on average, the lasso model only contains 4.5 terms (the\r\n% standard linear regression model includes 8).  More importantly, the\r\n% cross validated MSE for the linear regression model is about 30% larger\r\n% than that generated from the |lasso|.  This is an incredibly powerful\r\n% results.  The |lasso| algorithm is every bit as easy to apply as standard\r\n% linear regression, however, it offers significant improvements in\r\n% predictive accuracy compared to regression.\r\n \r\nrng(1998);\r\n \r\nfor i = 1 : 100\r\n    \r\n    X = mvnrnd(mu, covariance, 20);\r\n    Y = X * Beta + randn(20,1);\r\n    \r\n    [B Stats] = lasso(X,Y, 'CV', 5);\r\n    Betas(:,i) = B(:,Stats.IndexMinMSE) > 0;\r\n    Coeff_Num(i) = sum(B(:,Stats.IndexMinMSE) > 0);\r\n    MSE(i) = Stats.MSE(:, Stats.IndexMinMSE);\r\n    \r\n    regf = @(XTRAIN, ytrain, XTEST)(XTEST*regress(ytrain,XTRAIN));\r\n    cv_Reg_MSE(i) = crossval('mse',X,Y,'predfun',regf, 'kfold', 5);\r\n        \r\nend\r\n \r\nNumber_Lasso_Coefficients = mean(Coeff_Num);\r\ndisp(Number_Lasso_Coefficients)\r\n \r\nMSE_Ratio = median(cv_Reg_MSE)\/median(MSE);\r\ndisp(MSE_Ratio)\r\n\r\n%% Choosing the Best Technique\r\n% Regularization methods and feature selection techniques both have \r\n% unique strengths and weaknesses.  Let's close this blog posting with \r\n% some practical guidance regarding pros and cons for the various\r\n% techniques.\r\n%\r\n% Regularization techniques have two major advantages compared to feature\r\n% selection.\r\n%\r\n% * Regularization techniques are able to operate on much larger datasets \r\n% than feature selection methods.  Lasso and ridge regression can be\r\n% applied to datasets that contains thousands - even tens of thousands of\r\n% variables.  Even sequential feature selection is usually too slow to cope \r\n% with this many possible predictors.\r\n% * Regularization algorithms often generate more accurate predictive\r\n% models than feature selection.  Regularization operates over a\r\n% continuous space while feature selection operates over a discrete space.\r\n% As a result, regularization is often able to fine tune the model and\r\n% produce more accurate estimates.\r\n%\r\n% However, feature selection methods also have their advantages\r\n%\r\n% * Regularization tehcniques are only available for a small number of model\r\n% types.  Notably, regularization can be applied to linear regression and\r\n% logistic regression.  However, if you're working some other modeling\r\n% technique - say a boosted decision tree - you'll typically need to apply\r\n% feature selection techiques.\r\n% * Feature selection is easier to understand and explain to third parties.\r\n% Never underestimate the importance of being able to describe your methods\r\n% when sharing your results.\r\n%\r\n% With this said and done, each of the three regularization techniques also\r\n% offers its own unique advantages and disadvantages.\r\n%\r\n% * Because lasso uses an L1 norm it tends to force individual coefficient\r\n% values completely towards zero.  As a result, lasso works very well as a\r\n% feature selection algorithm.  It quickly identifies a small number of key\r\n% variables.\r\n% * In contrast, ridge regression uses an L2 norm for the coefficients\r\n% (you're minimizing the sum of the squared errors).  Ridge regression\r\n% tends to spread coefficient shrinkage across a larger number of\r\n% coefficients.  If you think that your model should contain a large number\r\n% of coefficients, ridge regression is probably a better choice than lasso.\r\n% * Last, but not least, we have the elastic net which is able to\r\n% compensate for a very specific limitation of lasso.  Lasso is unable to\r\n% identify more predictors than you have coefficients.  \r\n% \r\n% Let's assume that you are running a cancer research study.  \r\n%\r\n% * You have genes sequences for 500 different cancer patients\r\n% * You're trying to determine which of 15,000 different genes have a\r\n% signficant impact on the progression of the disease.\r\n%\r\n% Sequential feature selection is completely impractical with this \r\n% many different variables.  You can't use ridge regression because \r\n% it won't force coefficients completely to zero quickly enough. \r\n% At the same time, you can't use lasso since you might need to identify \r\n% more than 500 different genes.  The elastic net is one possible solution.\r\n\r\n%% Conclusion\r\n% If you'd like more information on this topic there is a MathWork's\r\n% webinar titled\r\n% <https:\/\/www.mathworks.com\/company\/events\/webinars\/wbnr59911.html?id=59911&p1=923401052&p2=923401070\r\n% Computational Statistics:  Feature Selection, Regularization, and\r\n% Shrinkage> which provides a more detailed treatment of these topics.\r\n%\r\n% In closing, I'd like to ask you whether any of you have practical\r\n% examples applying feature selection or regularization algorithms in your\r\n% work?\r\n%\r\n% * Have you ever used feature selection?  \r\n% * Do you see an opportunity to apply lasso or ridge regression in your\r\n% work?\r\n% \r\n% If so, please post here <https:\/\/blogs.mathworks.com\/loren\/?p=298#respond here>.\r\n\r\n\r\n\r\n\r\n##### SOURCE END ##### 70800e67086e4d37aeee1ea03b81b69f\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   \r\n      This week Richard Willey from technical marketing will finish his two part presentation on subset selection and regularization.\r\n      In a recent posting, we examined how to use... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/11\/29\/subset-selection-and-regularization-part-2\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[47,48],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/298"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=298"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/298\/revisions"}],"predecessor-version":[{"id":1952,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/298\/revisions\/1952"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=298"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=298"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=298"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}