{"id":385,"date":"2012-03-16T14:14:29","date_gmt":"2012-03-16T19:14:29","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=385"},"modified":"2012-03-05T14:15:15","modified_gmt":"2012-03-05T19:15:15","slug":"new-regression-capabilities-in-release-2012a","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2012\/03\/16\/new-regression-capabilities-in-release-2012a\/","title":{"rendered":"New Regression Capabilities in Release 2012A"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <introduction>\r\n      <p><i>This week Richard Willey from technical marketing will be guest blogging about new regression capabilities shipping with the\r\n            12a <a href=\"https:\/\/www.mathworks.com\/products\/statistics\/\">Statistics Toolbox<\/a> release.<\/i><\/p>\r\n   <\/introduction>\r\n   <h3>Contents<\/h3>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"#2\">Create a dataset<\/a><\/li>\r\n         <li><a href=\"#3\">Use linear regression to model Y as a function of X<\/a><\/li>\r\n         <li><a href=\"#5\">Use myFit for analysis<\/a><\/li>\r\n         <li><a href=\"#7\">Look for patterns in the residuals<\/a><\/li>\r\n         <li><a href=\"#9\">Look for autocorrelation in the residuals<\/a><\/li>\r\n         <li><a href=\"#10\">Look for outliers<\/a><\/li>\r\n         <li><a href=\"#12\">Use the resulting model for prediction<\/a><\/li>\r\n         <li><a href=\"#13\">Discover the full set of methods available with regression objects<\/a><\/li>\r\n         <li><a href=\"#14\">Discover the full set of information included in the regression object<\/a><\/li>\r\n         <li><a href=\"#15\">Formulas<\/a><\/li>\r\n         <li><a href=\"#17\">Modeling a high order polynomial (Option 1)<\/a><\/li>\r\n         <li><a href=\"#18\">Modeling a high order polynomial (Option 2)<\/a><\/li>\r\n         <li><a href=\"#19\">Nonlinear regression:  Generate our data<\/a><\/li>\r\n         <li><a href=\"#20\">Nonlinear regression:  Generate a fit<\/a><\/li>\r\n         <li><a href=\"#21\">Nonlinear regression: Work with the resulting model<\/a><\/li>\r\n         <li><a href=\"#22\">Nonlinear regression:  Alternative ways to specify the regression model<\/a><\/li>\r\n         <li><a href=\"#23\">Conclusion<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>The 12a release of Statistics Toolbox includes new functions for<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Linear regression<\/li>\r\n         <li>Nonlinear regression<\/li>\r\n         <li>Logistic regression (and other types of generalized linear models)<\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>These regression techniques aren&#8217;t new to Statistics Toolbox.  What is new is that MathWorks addded a wide set of support\r\n      functions that simplify common analysis tasks like plotting, outlier detection, generating predictions, performing stepwise\r\n      regression, applying robust regression...\r\n   <\/p>\r\n   <p>We'll start with a simple example using linear regression.<\/p>\r\n   <h3>Create a dataset<a name=\"2\"><\/a><\/h3>\r\n   <p>I'm going to generate a basic dataset in which the relationship between X and Y is modeled by a straight line (Y = mX + B)\r\n      and add in some normally distributed noise. Next, I'll generate a scatter plot showing the relationship between X and Y\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">clear <span style=\"color: #A020F0\">all<\/span>\r\nclc\r\nrng(1998);\r\n\r\nX = linspace(1,100,50);\r\nX = X';\r\nY = 7*X + 50 + 30*randn(50,1);\r\nNew_X = 100 * rand(10,1);\r\n\r\nscatter(X,Y, <span style=\"color: #A020F0\">'.'<\/span>)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/385\/fit12a_01.png\"> <h3>Use linear regression to model Y as a function of X<a name=\"3\"><\/a><\/h3>\r\n   <p>Looking at the data, we can see a clear linear relationship between X and Y.  I'm going to use the new LinearModel function\r\n      to model Y as a function of X.  The output from this function will be stored as an object named \"myFit\" which is displayed\r\n      on the screen.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">myFit = LinearModel.fit(X,Y)<\/pre><pre style=\"font-style:oblique\">myFit = \r\nLinear regression model:\r\n    y ~ 1 + x1\r\n\r\nEstimated Coefficients:\r\n                   Estimate    SE         tStat     pValue    \r\n    (Intercept)      51.6       8.2404    6.2618    9.9714e-08\r\n    x1             7.0154      0.14131    49.644    6.4611e-43\r\n\r\nNumber of observations: 50, Error degrees of freedom: 48\r\nRoot Mean Squared Error: 29.1\r\nR-squared: 0.981,  Adjusted R-Squared 0.98\r\nF-statistic vs. constant model: 2.46e+03, p-value = 6.46e-43\r\n<\/pre><p>The first line shows the linear regression model.  When perform a regression we need to specify a model that describes the\r\n      relationship between our variables.  By default, LinearModel assumes that you want to model the relationship as a straight\r\n      line with an intercept term.  The expression \"y ~ 1 + x1\" describes this model.  Formally, this expression translates as \"Y\r\n      is modeled as a linear function which includes an intercept and a variable\".  Once again note that we are representing a model\r\n      of the form Y = mX + B...\r\n   <\/p>\r\n   <p>The next block of text includes estimates for the coefficients, along with basic information regarding the reliability of\r\n      those estimates.\r\n   <\/p>\r\n   <p>Finally, we have basic information about the goodness-of-fit including the R-square, the adjusted R-square and the Root Mean\r\n      Squared Error.\r\n   <\/p>\r\n   <h3>Use myFit for analysis<a name=\"5\"><\/a><\/h3>\r\n   <p>Earlier, I mentioned that the new regression functions include a wide variety of support functions that automate different\r\n      analysis tasks. Let's look at some them.\r\n   <\/p>\r\n   <p>First, lets generate a plot that we can use to evaluate the quality of the resulting fit.  We'll do so by applying the standard\r\n      MATLAB plot command to \"myFit\".\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">plot(myFit)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/385\/fit12a_02.png\"> <p>Notice that this simple command creates a plot with a wealth of information including<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>A scatter plot of the original dataset<\/li>\r\n         <li>A line showing our fit<\/li>\r\n         <li>Confidence intervals for the fit<\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>MATLAB has also automatically labelled our axes and added a legend.<\/p>\r\n   <h3>Look for patterns in the residuals<a name=\"7\"><\/a><\/h3>\r\n   <p>Alternatively, lets assume that we wanted to see whether there was any pattern to the residuals.  (A noticeable pattern to\r\n      the residuals might suggest that our model is to simple and that it failed to capture a real work trend in the data set. \r\n      This technique can also be used to check and see whether the noise component is constant across the dataset).\r\n   <\/p>\r\n   <p>Here, I'll pass \"myFit\" to the new \"plotResiduals\" method and tell plotResiduals to plot the residuals versus the fitted values.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">figure\r\nplotResiduals(myFit, <span style=\"color: #A020F0\">'fitted'<\/span>)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/385\/fit12a_03.png\"> <p>My plot looks like random noise - which in this case is a very good thing.<\/p>\r\n   <h3>Look for autocorrelation in the residuals<a name=\"9\"><\/a><\/h3>\r\n   <p>Autocorrelation in my data set could also throw off the quality of my fit.  The following command will modify the residual\r\n      plot by plotting residuals versus lagged residuals.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">figure\r\nplotResiduals(myFit, <span style=\"color: #A020F0\">'lagged'<\/span>)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/385\/fit12a_04.png\"> <h3>Look for outliers<a name=\"10\"><\/a><\/h3>\r\n   <p>Suppose that I wanted to check for outliers...  We also have a plot for that.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">figure\r\nplotDiagnostics(myFit, <span style=\"color: #A020F0\">'cookd'<\/span>)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/385\/fit12a_05.png\"> <p>Cook's Distance is a metric that is commonly used to see whether a dataset contains any outliers.  For any given data point,\r\n      Cook's Distance is calculated by performing a brand new regression that excludes that data point.  Cook's distance measures\r\n      how much the shape of the curve changes between the two fits.  If the curve moves by a large amount, that data point has a\r\n      great deal of influence on the model and might very well be an outlier.\r\n   <\/p>\r\n   <div>\r\n      <ul>\r\n         <li>The red crosses show the Cook's Distance for each point in the data set.<\/li>\r\n         <li>The horizontal line shows \"Three times the average Cook's Distance for all the points in the data set\".  Data points whose\r\n            Cook's Distance is greater than three times the mean are often considered possible outliers.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>In this example, non of our data points look as if they are outliers.<\/p>\r\n   <h3>Use the resulting model for prediction<a name=\"12\"><\/a><\/h3>\r\n   <p>Last, but not least, lets assume that we wanted to use our model for prediction.  This is as easy as applying the \"predict\"\r\n      method.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">Predictions = predict(myFit, New_X)<\/pre><pre style=\"font-style:oblique\">Predictions =\r\n       310.51\r\n       596.17\r\n       64.866\r\n       490.79\r\n       306.06\r\n       308.87\r\n       621.55\r\n       366.15\r\n       543.32\r\n       317.33\r\n<\/pre><h3>Discover the full set of methods available with regression objects<a name=\"13\"><\/a><\/h3>\r\n   <p>I hope that you agree that all these built in plots and analysis routines represent a significant improvement in usability.\r\n       However, if you're anything like me, you immediate reaction is going to be \"Great, you've built a lot of nice stuff, however,\r\n      how do you expect me to find out about this?\"\r\n   <\/p>\r\n   <p>What I'd like to do now is show you a couple of simple tricks that you can use to discover all the new cabilities that we've\r\n      added.  The first trick is to recognize that \"myFit\" is an object and that objects have methods associated with them.  All\r\n      of the commands that we've sued so far like \"plot\", \"plotResiduals\", and \"predict\" are methods for the LinearModel object.\r\n   <\/p>\r\n   <p>Any time that I'm working with one of the built in objects that ship with MATLAB my first plot is to inspect the full set\r\n      of methods that ship with that object.  This is as easy as typing methods(myFit) at the command line.  I can use this to immediately\r\n      discover all the built in capabilities that ship with the object. If one of those options catches my eye, I can use the help\r\n      system to get more information.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">methods(myFit)<\/pre><pre style=\"font-style:oblique\">\r\nMethods for class LinearModel:\r\n\r\naddTerms              plot                  plotSlice             \r\nanova                 plotAdded             predict               \r\ncoefCI                plotAdjustedResponse  random                \r\ncoefTest              plotDiagnostics       removeTerms           \r\ndisp                  plotEffects           step                  \r\ndwtest                plotInteraction       \r\nfeval                 plotResiduals         \r\n\r\nStatic methods:\r\n\r\nfit                   stepwise              \r\n\r\n<\/pre><h3>Discover the full set of information included in the regression object<a name=\"14\"><\/a><\/h3>\r\n   <p>Here's another really useful trick to learn about the new regression objects.  You can use the MATLAB variable editor to walk\r\n      through the object and see all the information that is availabe.\r\n   <\/p>\r\n   <p>You should have an object named \"myFit\" in the MATLAB workspace.  Double clicking on the object will open the object in the\r\n      Variable Editor.\r\n   <\/p>\r\n   <h3>Formulas<a name=\"15\"><\/a><\/h3>\r\n   <p>At the start of this blog there was some brief introduction to \"formulas\".  I'd like to conclude this talk by providing a\r\n      bit more information about formulas.  Regression analysis requires the ability to specify a model that describes the relationship\r\n      between your predictors and your response variables.\r\n   <\/p>\r\n   <p>Let's change our initial example such that we're working with a high order polynomial rather than a straight line.  I'm also\r\n      going to change this from a curve fitting problem to a surface fitting problem.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">X1 = 100 * randn(100,1);\r\nX2 = 100 * rand(100,1);\r\nX = [X1, X2];\r\nY = 3*X1.^2 + 5*X1.*X2 + 7* X2.^2 + 9*X1 + 11*X2 + 30 + 100*randn(100,1);\r\nmyFit2 = LinearModel.fit(X,Y)<\/pre><pre style=\"font-style:oblique\">myFit2 = \r\nLinear regression model:\r\n    y ~ 1 + x1 + x2\r\n\r\nEstimated Coefficients:\r\n                   Estimate    SE       tStat     pValue   \r\n    (Intercept)     25771      11230    2.2949     0.023893\r\n    x1             178.75      54.95     3.253    0.0015717\r\n    x2             613.39      187.2    3.2767    0.0014579\r\n\r\nNumber of observations: 100, Error degrees of freedom: 97\r\nRoot Mean Squared Error: 5.42e+04\r\nR-squared: 0.211,  Adjusted R-Squared 0.195\r\nF-statistic vs. constant model: 13, p-value = 1.03e-05\r\n<\/pre><p>Let's take a look at the output from this example.  We can see, almost immediately, that something has gone wrong with our\r\n      fit.\r\n   <\/p>\r\n   <div>\r\n      <ul>\r\n         <li>The R^2 value is pretty bad<\/li>\r\n         <li>The regression coefficients are nowhere near the ones we specified when we created the dataset<\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>If we look at the line that describes the linear regression model we can see what went wrong.  By default, LinearModel will\r\n      fit a plane to the dataset.  Here, the relationship between X and Y is modelled as a high order polynomial.  We need to pass\r\n      this additional piece of information to \"LinearModel\".\r\n   <\/p>\r\n   <h3>Modeling a high order polynomial (Option 1)<a name=\"17\"><\/a><\/h3>\r\n   <p>Here are a couple different ways that I can use LinearModel to model a high order polynomial.  The first option is to write\r\n      out the formula by hand.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">myFit2 = LinearModel.fit(X,Y, <span style=\"color: #A020F0\">'y ~ 1 + x1^2 + x2^2 + x1:x2 + x1 + x2'<\/span>)<\/pre><pre style=\"font-style:oblique\">myFit2 = \r\nLinear regression model:\r\n    y ~ 1 + x1*x2 + x1^2 + x2^2\r\n\r\nEstimated Coefficients:\r\n                   Estimate    SE           tStat      pValue     \r\n    (Intercept)    32.317         38.071    0.84886        0.39811\r\n    x1             9.1735         0.2119     43.291     6.9346e-64\r\n    x2              12.19         1.6568     7.3577     6.9388e-11\r\n    x1:x2          4.9985      0.0033943     1472.6    7.0691e-207\r\n    x1^2           2.9992      0.0006268       4785    5.5265e-255\r\n    x2^2           6.9849       0.015281     457.11    3.9814e-159\r\n\r\nNumber of observations: 100, Error degrees of freedom: 94\r\nRoot Mean Squared Error: 101\r\nR-squared: 1,  Adjusted R-Squared 1\r\nF-statistic vs. constant model: 7.02e+06, p-value = 3.23e-260\r\n<\/pre><h3>Modeling a high order polynomial (Option 2)<a name=\"18\"><\/a><\/h3>\r\n   <p>Alternatively, I can simple use the the string \"poly22\" to indicate a second order polynomial for both X1 and X2 an automatically\r\n      generate all the appropriate terms and cross terms.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">myFit2 = LinearModel.fit(X, Y, <span style=\"color: #A020F0\">'poly22'<\/span>)<\/pre><pre style=\"font-style:oblique\">myFit2 = \r\nLinear regression model:\r\n    y ~ 1 + x1^2 + x1*x2 + x2^2\r\n\r\nEstimated Coefficients:\r\n                   Estimate    SE           tStat      pValue     \r\n    (Intercept)    32.317         38.071    0.84886        0.39811\r\n    x1             9.1735         0.2119     43.291     6.9346e-64\r\n    x2              12.19         1.6568     7.3577     6.9388e-11\r\n    x1^2           2.9992      0.0006268       4785    5.5265e-255\r\n    x1:x2          4.9985      0.0033943     1472.6    7.0691e-207\r\n    x2^2           6.9849       0.015281     457.11    3.9814e-159\r\n\r\nNumber of observations: 100, Error degrees of freedom: 94\r\nRoot Mean Squared Error: 101\r\nR-squared: 1,  Adjusted R-Squared 1\r\nF-statistic vs. constant model: 7.02e+06, p-value = 3.23e-260\r\n<\/pre><h3>Nonlinear regression:  Generate our data<a name=\"19\"><\/a><\/h3>\r\n   <p>Let's consider a nonlinear regression example.  This time around, we'll work with a sin curve.  The equation for a sin curve\r\n      is governed by four key parameters.\r\n   <\/p>\r\n   <div>\r\n      <ol>\r\n         <li>The phase<\/li>\r\n         <li>The amplitude<\/li>\r\n         <li>The vertical shift<\/li>\r\n         <li>The phase shift<\/li>\r\n      <\/ol>\r\n   <\/div>\r\n   <p>We'll start by generating a our dataset<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">X = linspace(0, 6*pi, 90);\r\nX = X';\r\nY = 10 + 3*(sin(1*X + 5)) + .2*randn(90,1);<\/pre><h3>Nonlinear regression:  Generate a fit<a name=\"20\"><\/a><\/h3>\r\n   <p>Next we'll using the NonLinearModel function to perform a nonlinear regression.  Here we need to<\/p>\r\n   <div>\r\n      <ol>\r\n         <li>Specify formula that describes the relationship between X and Y<\/li>\r\n         <li>Provide some reasonable starting conditions for the optimization solvers<\/li>\r\n      <\/ol>\r\n   <\/div><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">myFit3 = NonLinearModel.fit(X,Y, <span style=\"color: #A020F0\">'y ~ b0 + b1*sin(b2*x + b3)'<\/span>, [11, 2.5, 1.1, 5.5])<\/pre><pre style=\"font-style:oblique\">myFit3 = \r\nNonlinear regression model:\r\n    y ~ b0 + b1*sin(b2*x + b3)\r\n\r\nEstimated Coefficients:\r\n          Estimate    SE           tStat     pValue     \r\n    b0     9.9751       0.02469    404.02    9.0543e-143\r\n    b1     2.9629      0.033784    87.703     6.4929e-86\r\n    b2    0.99787     0.0021908    455.47     3.029e-147\r\n    b3     5.0141      0.023119    216.88    1.4749e-119\r\n\r\nNumber of observations: 90, Error degrees of freedom: 86\r\nRoot Mean Squared Error: 0.227\r\nR-Squared: 0.989,  Adjusted R-Squared 0.989\r\nF-statistic vs. constant model: 2.58e+03, p-value = 4.36e-84\r\n<\/pre><h3>Nonlinear regression: Work with the resulting model<a name=\"21\"><\/a><\/h3>\r\n   <p>Once again, the output from the regression analysis is an object which we can used for analysis.  For example:<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">figure\r\nscatter(X,Y)\r\nhold <span style=\"color: #A020F0\">on<\/span>\r\nplot(X, myFit3.Fitted, <span style=\"color: #A020F0\">'r'<\/span>)<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/385\/fit12a_06.png\"> <h3>Nonlinear regression:  Alternative ways to specify the regression model<a name=\"22\"><\/a><\/h3>\r\n   <p>One last point to be aware of:  The syntax that I have been using to specify regression models is based on \"Wilkinson's notation\".\r\n       This is a standard syntax that is commonly used in Statistics.  If you prefer, you also have the option to specify your model\r\n      using anonymous functions.\r\n   <\/p>\r\n   <p>For example, that command could have been written as<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">myFit4 = NonLinearModel.fit(X,Y, @(b,x)(b(1) + b(2)*sin(b(3)*x + b(4))), [11, 2.5, 1.1, 5.5])<\/pre><pre style=\"font-style:oblique\">myFit4 = \r\nNonlinear regression model:\r\n    y ~ (b1 + b2*sin(b3*x + b4))\r\n\r\nEstimated Coefficients:\r\n          Estimate    SE           tStat     pValue     \r\n    b1     9.9751       0.02469    404.02    9.0543e-143\r\n    b2     2.9629      0.033784    87.703     6.4929e-86\r\n    b3    0.99787     0.0021908    455.47     3.029e-147\r\n    b4     5.0141      0.023119    216.88    1.4749e-119\r\n\r\nNumber of observations: 90, Error degrees of freedom: 86\r\nRoot Mean Squared Error: 0.227\r\nR-Squared: 0.989,  Adjusted R-Squared 0.989\r\nF-statistic vs. constant model: 2.58e+03, p-value = 4.36e-84\r\n<\/pre><h3>Conclusion<a name=\"23\"><\/a><\/h3>\r\n   <p>I've been working with Statistics Toolbox for close to five years now. Other than the introduction of dataset arrays a few\r\n      years back, nothing has gotten me nearly as excited as the release of these new regression capabilities.  So, it seemed fitting\r\n      to conclude with an example that combines dataset arrays and the new regression objects.\r\n   <\/p>\r\n   <p>Tell me what you think is going on in the following example.  (Extra credit if you can work in the expression \"dummy variable\")<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">load <span style=\"color: #A020F0\">carsmall<\/span>\r\n\r\nds = dataset(MPG,Weight);\r\nds.Year = ordinal(Model_Year);\r\n\r\nmdl = LinearModel.fit(ds,<span style=\"color: #A020F0\">'MPG ~ Year + Weight^2'<\/span>)<\/pre><pre style=\"font-style:oblique\">mdl = \r\nLinear regression model:\r\n    MPG ~ 1 + Weight + Year + Weight^2\r\n\r\nEstimated Coefficients:\r\n                   Estimate      SE            tStat      pValue    \r\n    (Intercept)        54.206        4.7117     11.505    2.6648e-19\r\n    Weight          -0.016404     0.0031249    -5.2493    1.0283e-06\r\n    Year_76            2.0887       0.71491     2.9215     0.0044137\r\n    Year_82            8.1864       0.81531     10.041    2.6364e-16\r\n    Weight^2       1.5573e-06    4.9454e-07      3.149     0.0022303\r\n\r\nNumber of observations: 94, Error degrees of freedom: 89\r\nRoot Mean Squared Error: 2.78\r\nR-squared: 0.885,  Adjusted R-Squared 0.88\r\nF-statistic vs. constant model: 172, p-value = 5.52e-41\r\n<\/pre><p>Post your answers <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=385#respond\">here<\/a>.\r\n   <\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_3fa6aab696514fcb80b3158f689b4e51() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='3fa6aab696514fcb80b3158f689b4e51 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 3fa6aab696514fcb80b3158f689b4e51';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Richard Willey';\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_3fa6aab696514fcb80b3158f689b4e51()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.14<br><\/p>\r\n<\/div>\r\n<!--\r\n3fa6aab696514fcb80b3158f689b4e51 ##### SOURCE BEGIN #####\r\n%% New Regression Capabilities in Release 2012A\r\n% _This week Richard Willey from technical marketing will be guest blogging\r\n% about new regression capabilities shipping with the 12a <https:\/\/www.mathworks.com\/products\/statistics\/ Statistics Toolbox> release._\r\n \r\n\r\n\r\n%%\r\n% The 12a release of Statistics Toolbox includes new functions for \r\n%\r\n% * Linear regression\r\n% * Nonlinear regression \r\n% * Logistic regression (and other types of generalized linear models)\r\n%\r\n% These regression techniques aren\u00e2\u20ac\u2122t new to Statistics Toolbox.  What is\r\n% new is that MathWorks addded a wide set of support functions that \r\n% simplify common analysis tasks like plotting, outlier detection, \r\n% generating predictions, performing stepwise regression, applying robust \r\n% regression...\r\n%\r\n% We'll start with a simple example using linear regression.\r\n \r\n \r\n%% Create a dataset\r\n%\r\n% I'm going to generate a basic dataset in which the relationship between X\r\n% and Y is modeled by a straight line (Y = mX + B) and add in some normally \r\n% distributed noise. Next, I'll generate a scatter plot showing the \r\n% relationship between X and Y\r\n \r\nclear all\r\nclc\r\nrng(1998);\r\n \r\nX = linspace(1,100,50);\r\nX = X';\r\nY = 7*X + 50 + 30*randn(50,1);\r\nNew_X = 100 * rand(10,1);\r\n \r\nscatter(X,Y, '.')\r\n \r\n%% Use linear regression to model Y as a function of X\r\n%\r\n% Looking at the data, we can see a clear linear relationship between X \r\n% and Y.  I'm going to use the new LinearModel function to model\r\n% Y as a function of X.  The output from this function will be stored as an\r\n% object named \"myFit\" which is displayed on the screen.\r\n \r\nmyFit = LinearModel.fit(X,Y)\r\n\r\n%%\r\n% The first line shows the linear regression model.  When perform a\r\n% regression we need to specify a model that describes the relationship\r\n% between our variables.  By default, LinearModel assumes that you want to\r\n% model the relationship as a straight line with an intercept term.  The\r\n% expression \"y ~ 1 + x1\" describes this model.  Formally, this expression \r\n% translates as \"Y is modeled as a linear function which includes an \r\n% intercept and a variable\".  Once again note that we are representing a \r\n% model of the form Y = mX + B...\r\n%\r\n% The next block of text includes estimates for the coefficients, along\r\n% with basic information regarding the reliability of those estimates.  \r\n%\r\n% Finally, we have basic information about the goodness-of-fit including\r\n% the R-square, the adjusted R-square and the Root Mean Squared Error.\r\n \r\n%% Use myFit for analysis\r\n%\r\n% Earlier, I mentioned that the new regression functions include a wide\r\n% variety of support functions that automate different analysis tasks.\r\n% Let's look at some them.\r\n%\r\n% First, lets generate a plot that we can use to evaluate the quality of\r\n% the resulting fit.  We'll do so by applying the standard MATLAB plot\r\n% command to \"myFit\".\r\n \r\nplot(myFit)\r\n\r\n%%\r\n% Notice that this simple command creates a plot with a wealth of\r\n% information including\r\n%\r\n% * A scatter plot of the original dataset\r\n% * A line showing our fit\r\n% * Confidence intervals for the fit\r\n%\r\n% MATLAB has also automatically labelled our axes and added a legend.\r\n \r\n%% Look for patterns in the residuals\r\n%\r\n% Alternatively, lets assume that we wanted to see whether there was any\r\n% pattern to the residuals.  (A noticeable pattern to the residuals might\r\n% suggest that our model is to simple and that it failed to capture a real\r\n% work trend in the data set.  This technique can also be used to check and\r\n% see whether the noise component is constant across the dataset).  \r\n%\r\n% Here, I'll pass \"myFit\" to the new \"plotResiduals\" method and tell\r\n% plotResiduals to plot the residuals versus the fitted values.\r\n \r\nfigure\r\nplotResiduals(myFit, 'fitted') \r\n\r\n%%\r\n% My plot looks like random noise - which in this case is a very good\r\n% thing.  \r\n \r\n%% Look for autocorrelation in the residuals\r\n%\r\n% Autocorrelation in my data set could also throw off the quality \r\n% of my fit.  The following command will modify the residual plot by \r\n% plotting residuals versus lagged residuals.\r\n \r\nfigure\r\nplotResiduals(myFit, 'lagged')\r\n \r\n%% Look for outliers\r\n%\r\n% Suppose that I wanted to check for outliers...  We also have a plot for\r\n% that.\r\n \r\nfigure\r\nplotDiagnostics(myFit, 'cookd')\r\n\r\n%%\r\n% Cook's Distance is a metric that is commonly used to see whether a\r\n% dataset contains any outliers.  For any given data point, Cook's Distance\r\n% is calculated by performing a brand new regression that excludes that\r\n% data point.  Cook's distance measures how much the shape of the curve\r\n% changes between the two fits.  If the curve moves by a large amount, that\r\n% data point has a great deal of influence on the model and might very well\r\n% be an outlier.  \r\n%\r\n% * The red crosses show the Cook's Distance for each point in the data set.  \r\n% * The horizontal line shows \"Three times the average Cook's Distance for\r\n% all the points in the data set\".  Data points whose Cook's Distance is \r\n% greater than three times the mean are often considered possible outliers.\r\n%\r\n% In this example, non of our data points look as if they are outliers. \r\n \r\n%% Use the resulting model for prediction\r\n%\r\n% Last, but not least, lets assume that we wanted to use our model for\r\n% prediction.  This is as easy as applying the \"predict\" method.\r\n \r\nPredictions = predict(myFit, New_X)\r\n \r\n%% Discover the full set of methods available with regression objects\r\n%\r\n% I hope that you agree that all these built in plots and analysis routines\r\n% represent a significant improvement in usability.  However, if you're\r\n% anything like me, you immediate reaction is going to be \"Great, you've\r\n% built a lot of nice stuff, however, how do you expect me to find out\r\n% about this?\"\r\n%\r\n% What I'd like to do now is show you a couple of simple tricks that you\r\n% can use to discover all the new cabilities that we've added.  The first\r\n% trick is to recognize that \"myFit\" is an object and that objects have\r\n% methods associated with them.  All of the commands that we've sued so far\r\n% like \"plot\", \"plotResiduals\", and \"predict\" are methods for the\r\n% LinearModel object.\r\n%\r\n% Any time that I'm working with one of the built in objects that ship with\r\n% MATLAB my first plot is to inspect the full set of methods that ship with\r\n% that object.  This is as easy as typing methods(myFit) at the command\r\n% line.  I can use this to immediately discover all the built in \r\n% capabilities that ship with the object. If one of those options catches\r\n% my eye, I can use the help system to get more information.\r\n \r\nmethods(myFit) \r\n \r\n%% Discover the full set of information included in the regression object\r\n%\r\n% Here's another really useful trick to learn about the new regression\r\n% objects.  You can use the MATLAB variable editor to walk through the\r\n% object and see all the information that is availabe.  \r\n%\r\n% You should have an object named \"myFit\" in the MATLAB workspace.  Double\r\n% clicking on the object will open the object in the Variable Editor. \r\n \r\n%% Formulas\r\n%\r\n% At the start of this blog there was some brief introduction to\r\n% \"formulas\".  I'd like to conclude this talk by providing a bit more\r\n% information about formulas.  Regression analysis requires the ability to\r\n% specify a model that describes the relationship between your predictors\r\n% and your response variables.\r\n%\r\n% Let's change our initial example such that we're working with a high\r\n% order polynomial rather than a straight line.  I'm also going to change\r\n% this from a curve fitting problem to a surface fitting problem.\r\n \r\nX1 = 100 * randn(100,1);\r\nX2 = 100 * rand(100,1);\r\nX = [X1, X2];\r\nY = 3*X1.^2 + 5*X1.*X2 + 7* X2.^2 + 9*X1 + 11*X2 + 30 + 100*randn(100,1); \r\nmyFit2 = LinearModel.fit(X,Y)\r\n\r\n%%\r\n% Let's take a look at the output from this example.  We can see, almost\r\n% immediately, that something has gone wrong with our fit.\r\n%\r\n% * The R^2 value is pretty bad\r\n% * The regression coefficients are nowhere near the ones we specified when\r\n% we created the dataset\r\n%\r\n% If we look at the line that describes the linear regression model we can \r\n% see what went wrong.  By default, LinearModel will fit a plane to the\r\n% dataset.  Here, the relationship between X and Y is modelled as a high\r\n% order polynomial.  We need to pass this additional piece of information \r\n% to \"LinearModel\".\r\n \r\n%%  Modeling a high order polynomial (Option 1)\r\n%\r\n% Here are a couple different ways that I can use LinearModel to model a\r\n% high order polynomial.  The first option is to write out the formula by\r\n% hand.\r\n \r\nmyFit2 = LinearModel.fit(X,Y, 'y ~ 1 + x1^2 + x2^2 + x1:x2 + x1 + x2')\r\n \r\n%% Modeling a high order polynomial (Option 2)\r\n%\r\n% Alternatively, I can simple use the the string \"poly22\" to indicate a\r\n% second order polynomial for both X1 and X2 an automatically generate all\r\n% the appropriate terms and cross terms.\r\n \r\nmyFit2 = LinearModel.fit(X, Y, 'poly22')\r\n \r\n%% Nonlinear regression:  Generate our data\r\n%\r\n% Let's consider a nonlinear regression example.  This time around,\r\n% we'll work with a sin curve.  The equation for a sin curve is governed by\r\n% four key parameters.\r\n%\r\n% # The phase\r\n% # The amplitude\r\n% # The vertical shift\r\n% # The phase shift\r\n%\r\n% We'll start by generating a our dataset\r\n \r\nX = linspace(0, 6*pi, 90);\r\nX = X';\r\nY = 10 + 3*(sin(1*X + 5)) + .2*randn(90,1);\r\n \r\n%% Nonlinear regression:  Generate a fit\r\n%\r\n% Next we'll using the NonLinearModel function to perform a nonlinear\r\n% regression.  Here we need to \r\n%\r\n% # Specify formula that describes the relationship between X and Y\r\n% # Provide some reasonable starting conditions for the optimization\r\n% solvers\r\n \r\nmyFit3 = NonLinearModel.fit(X,Y, 'y ~ b0 + b1*sin(b2*x + b3)', [11, 2.5, 1.1, 5.5])\r\n \r\n%% Nonlinear regression: Work with the resulting model\r\n%\r\n% Once again, the output from the regression analysis is an object which we\r\n% can used for analysis.  For example:\r\n%\r\n \r\nfigure\r\nscatter(X,Y)\r\nhold on\r\nplot(X, myFit3.Fitted, 'r')\r\n \r\n%% Nonlinear regression:  Alternative ways to specify the regression model\r\n%\r\n% One last point to be aware of:  The syntax that I have been using to\r\n% specify regression models is based on \"Wilkinson's notation\".  This is a\r\n% standard syntax that is commonly used in Statistics.  If you prefer, you\r\n% also have the option to specify your model using anonymous functions.\r\n%\r\n% For example, that command could have been written as\r\n \r\nmyFit4 = NonLinearModel.fit(X,Y, @(b,x)(b(1) + b(2)*sin(b(3)*x + b(4))), [11, 2.5, 1.1, 5.5])\r\n \r\n%% Conclusion\r\n%\r\n% I've been working with Statistics Toolbox for close to five years now.\r\n% Other than the introduction of dataset arrays a few years back, nothing\r\n% has gotten me nearly as excited as the release of these new regression\r\n% capabilities.  So, it seemed fitting to conclude with an example that\r\n% combines dataset arrays and the new regression objects.\r\n%\r\n% Tell me what you think is going on in the following example.  (Extra \r\n% credit if you can work in the expression \"dummy variable\")\r\n% \r\nload carsmall\r\n \r\nds = dataset(MPG,Weight);\r\nds.Year = ordinal(Model_Year);\r\n \r\nmdl = LinearModel.fit(ds,'MPG ~ Year + Weight^2')\r\n\r\n%%\r\n% Post your answers <https:\/\/blogs.mathworks.com\/loren\/?p=385#respond\r\n% here>.\r\n\r\n\r\n\r\n\r\n\r\n\r\n##### SOURCE END ##### 3fa6aab696514fcb80b3158f689b4e51\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   \r\n      This week Richard Willey from technical marketing will be guest blogging about new regression capabilities shipping with the\r\n            12a Statistics Toolbox release.\r\n   \r\n  ... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2012\/03\/16\/new-regression-capabilities-in-release-2012a\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[47,48],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/385"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=385"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/385\/revisions"}],"predecessor-version":[{"id":390,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/385\/revisions\/390"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=385"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=385"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}