{"id":11136,"date":"2024-05-04T10:33:38","date_gmt":"2024-05-04T14:33:38","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=11136"},"modified":"2024-05-04T14:47:32","modified_gmt":"2024-05-04T18:47:32","slug":"r-squared-is-bigger-better","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2024\/05\/04\/r-squared-is-bigger-better\/","title":{"rendered":"R-squared.  Is Bigger Better?"},"content":{"rendered":"<div class=\"content\"><!--introduction-->\r\n<p>The <i>coefficient of determination<\/i>, R-squared or <tt>R^2<\/tt>, is a popular statistic that describes how well a regression model fits data. It measures the proportion of variation in data that is predicted by a model. However, that is <i>all<\/i> that <tt>R^2<\/tt> measures. It is <i>not<\/i> appropriate for any other use. For example, it does not support extrapolation beyond the domain of the data. It does not suggest that one model is preferable to another.<\/p>\r\n<p>I recently watched high school students participate in the final round of a national mathematical modeling competition. The teams' presentations were excellent; they were well-prepared, mathematically sophisticated, and informative. Unfortunately, many of the presentations abused <tt>R^2<\/tt>. It was used to compare different fits, to justify extrapolation, and to recommend public policy.<\/p>\r\n<p>This was not the first time that I have seen abuses of <tt>R^2<\/tt>. As educators and authors of mathematical software, we must do more to expose its limitations. There are dozens of pages and videos on the web describing <tt>R^2<\/tt>, but few of them warn about possible misuse.<\/p>\r\n<p>\r\n<tt>R^2<\/tt> is easily computed. If <tt>y<\/tt> is a vector of observations, <tt>f<\/tt> is a fit to the data and <tt>ybar = mean(y)<\/tt>, then<\/p>\r\n<pre>   R^2 = 1 - norm(y-f)^2\/norm(y-ybar)^2<\/pre>\r\n<p>If the data are centered, then <tt>ybar = 0<\/tt> and <tt>R^2<\/tt> is between zero and one.<\/p>\r\n<!--\/introduction-->\r\n<p>One of my favorite examples is the United States Census. Here is the population, in millions, every ten years since 1900.<\/p>\r\n<pre>   t         p\r\n  ____    _______\r\n  1900     75.995\r\n  1910     91.972\r\n  1920    105.711\r\n  1930    123.203\r\n  1940    131.669\r\n  1950    150.697\r\n  1960    179.323\r\n  1970    203.212\r\n  1980    226.505\r\n  1990    249.633\r\n  2000    281.422\r\n  2010    308.746\r\n  2020    331.449<\/pre>\r\n<p>There are 13 observations. So, we can do a least-squares fit by a polynomial of any degree less than 12 and can interpolate by a polynomial of degree 12. Here are four such fits and the corresponding <tt>R^2<\/tt> values. As the degree increases, so does <tt>R^2<\/tt>. Interpolation fits the data exactly and earns a perfect core.<\/p>\r\n<p>Which fit would you choose to predict the population in 2030, or even to estimate the population between census years?<\/p>\r\n<pre class=\"codeinput\">R2_census\r\n<\/pre>\r\n<img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/R2_blog_01.png\" alt=\"\"> <p>Thanks to Peter Perkins and Tom Lane for help with this post.<\/p>\r\n<script language=\"JavaScript\"> <!-- \r\n    function grabCode_c876d07b32bc454d8001573d08d2f7aa() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='c876d07b32bc454d8001573d08d2f7aa ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' c876d07b32bc454d8001573d08d2f7aa';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2024 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script>\r\n<p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\">\r\n<br>\r\n<a href=\"javascript:grabCode_c876d07b32bc454d8001573d08d2f7aa()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript>\r\n<\/span><\/a>\r\n<br>\r\n<br>\r\n      Published with MATLAB&reg; R2024a<br>\r\n<\/p>\r\n<\/div>\r\n<!--\r\nc876d07b32bc454d8001573d08d2f7aa ##### SOURCE BEGIN #####\r\n%% R-squared.  Is Bigger Better?\r\n% The _coefficient of determination_, R-squared or |R^2|, is a popular\r\n% statistic that describes how well a regression model fits data.\r\n% It measures the proportion of variation in data that is\r\n% predicted by a model.  However, that is _all_ that |R^2|\r\n% measures.  It is _not_ appropriate for any other use.  For example,\r\n% it does not support extrapolation beyond the domain of the data.\r\n% It does not suggest that one model is preferable to\r\n% another.  \r\n%\r\n% I recently watched high school students participate in the\r\n% final round of a national mathematical modeling competition.\r\n% The teams' presentations were excellent; they were\r\n% well-prepared, mathematically sophisticated, and informative.\r\n% Unfortunately, many of the presentations abused |R^2|.  It was\r\n% used to compare different fits, to justify extrapolation,\r\n% and to recommend public policy.\r\n%\r\n% This was not the first time that I have seen abuses of |R^2|.\r\n% As educators and authors of mathematical software, we must\r\n% do more to expose its limitations.  There are dozens of pages\r\n% and videos on the web describing |R^2|, but few of them warn\r\n% about possible misuse.\r\n%\r\n% |R^2| is easily computed. \r\n% If |y| is a vector of observations, |f| is a fit to\r\n% the data and |ybar = mean(y)|, then\r\n% \r\n%     R^2 = 1 - norm(y-f)^2\/norm(y-ybar)^2 \r\n%\r\n% If the data are centered, then |ybar = 0| and |R^2| is between\r\n% zero and one.\r\n\r\n%%\r\n% One of my favorite examples is the United States Census.\r\n% Here is the population, in millions, every ten years since 1900. \r\n%\r\n%     t         p   \r\n%    ____    _______\r\n%    1900     75.995\r\n%    1910     91.972\r\n%    1920    105.711\r\n%    1930    123.203\r\n%    1940    131.669\r\n%    1950    150.697\r\n%    1960    179.323\r\n%    1970    203.212\r\n%    1980    226.505\r\n%    1990    249.633\r\n%    2000    281.422\r\n%    2010    308.746\r\n%    2020    331.449\r\n%\r\n% There are 13 observations.  So, we can do a least-squares fit\r\n% by a polynomial of any degree less than 12 and can\r\n% interpolate by a polynomial of degree 12.  Here are four such\r\n% fits and the corresponding |R^2| values.  As the degree increases,\r\n% so does |R^2|.  Interpolation fits the data exactly and earns\r\n% a perfect core.\r\n%\r\n% Which fit would you choose to predict the population in 2030,\r\n% or even to estimate the population between census years?\r\n\r\nR2_census\r\n\r\n%%\r\n% Thanks to Peter Perkins and Tom Lane for help with this post.\r\n\r\n##### SOURCE END ##### c876d07b32bc454d8001573d08d2f7aa\r\n-->\r\n","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/degree7.png\" class=\"img-responsive attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/><\/div><!--introduction-->\r\n<p>The <i>coefficient of determination<\/i>, R-squared or <tt>R^2<\/tt>, is a popular statistic that describes how well a regression model fits data. It measures the proportion of variation in data that is predicted by a model. However, that is <i>all<\/i> that <tt>R^2<\/tt> measures. It is <i>not<\/i> appropriate for any other use. For example, it does not support extrapolation beyond the domain of the data. It does not suggest that one model is preferable to another.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2024\/05\/04\/r-squared-is-bigger-better\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":11160,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[37,48],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/11136"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=11136"}],"version-history":[{"count":7,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/11136\/revisions"}],"predecessor-version":[{"id":11178,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/11136\/revisions\/11178"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media\/11160"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=11136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=11136"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=11136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}