{"id":3211,"date":"2012-02-10T10:05:51","date_gmt":"2012-02-10T15:05:51","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=3211"},"modified":"2012-02-10T10:05:51","modified_gmt":"2012-02-10T15:05:51","slug":"finding-the-best","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2012\/02\/10\/finding-the-best\/","title":{"rendered":"Finding the best distribution that fits the data"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/15007\">Jiro<\/a>'s pick this week is <a href=\"\"><tt>allfitdist<\/tt><\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/84146\">Mike Sheppard<\/a>.\r\n   <\/p>\r\n   <p>As an application engineer, I go out and deliver seminars on various topics (check out some upcoming <a href=\"https:\/\/www.mathworks.com\/company\/events\/seminars.html\">events<\/a>!), and one of the topics that seem to drum up a lot of interest is data modeling\/fitting. It's a very broad topic and spans\r\n      pretty much all industries. People are always trying to model some phenomena so that they can use them to make predictions,\r\n      understand characteristics, or optimize. The techniques people can use vary based on what they are trying to model.\r\n   <\/p>\r\n   <p>Probably the most common technique is <a href=\"http:\/\/en.wikipedia.org\/wiki\/Parametric_model\">parametric modeling<\/a>, where you know the form (equation) of the model. There are different ways of doing this in MATLAB, including commands like\r\n      <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/techdoc\/ref\/polyfit.html\"><tt>polyfit<\/tt><\/a> and the <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/techdoc\/ref\/mldivide.html\">back slash operator<\/a>. There are many other ways that span various techniques covered by toolboxes, such as <a href=\"https:\/\/www.mathworks.com\/products\/curvefitting\/\">Curve Fitting Toolbox<\/a>, <a href=\"https:\/\/www.mathworks.com\/products\/statistics\/\">Statistics Toolbox<\/a>, and <a href=\"https:\/\/www.mathworks.com\/products\/optimization\/\">Optimization Toolbox<\/a>.\r\n   <\/p>\r\n   <p>During one of my seminars on these modeling techniques, a user came up to me and asked me if it was possible to get an equation\r\n      just by providing some data (inputs and outputs). This is not possible without any assumptions on the model, but I hear this\r\n      question from time to time. When I dig in a little bit, it turns out that, most of the time, people have some idea for the\r\n      form of the model, like power series, etc. But if they truly want a black-box model, there are plenty of techniques out there\r\n      for doing that, such as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Decision_tree_learning\">decision tree learning<\/a> (1), <a href=\"http:\/\/en.wikipedia.org\/wiki\/Artificial_neural_network\">artificial neural networks<\/a> (2), and <a href=\"http:\/\/en.wikipedia.org\/wiki\/System_identification\">system identification<\/a> (3).\r\n   <\/p>\r\n   <p>Back to the story... The question from this user at the seminar generated a healthy discussion back at the office on how to\r\n      address this type of question. The key is that there is virtually an infinite number of equations that could describe a data\r\n      set. Without any constraints on the form, it's impossible to return a single equation. But then one of my colleagues pointed\r\n      out that this type of question may be more reasonable when it is about distribution fitting. It's a much smaller scope and\r\n      there may be a finite set of distributions that could be tested.\r\n   <\/p>\r\n   <p>This is where Mike's <tt>allfitdist<\/tt> comes into play. Statistics Toolbox supports a long list of <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/stats\/bqt29ct.html\">distributions<\/a>, including parametric and nonparametric distributions. <tt>allfitdist<\/tt> fits all valid parametric distributions to the data and sorts them using a metric you can use to compare the goodness of\r\n      the fit.\r\n   <\/p>\r\n   <p>Here's an example of finding the best distribution fit for a random data set with an assumed unknown continuous distribution\r\n      (mu=5, sigma=3).\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">% Create a normally distributed (mu: 5, sigma: 3) random data set<\/span>\r\nx = normrnd(5, 3, 1e4, 1);\r\n\r\n<span style=\"color: #228B22\">% Compute and plot results. The results are sorted by \"Bayesian information<\/span>\r\n<span style=\"color: #228B22\">% criterion\".<\/span>\r\n[D, PD] = allfitdist(x, <span style=\"color: #A020F0\">'PDF'<\/span>);<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_allfitdist\/potw_allfitdist_01.png\"> <p>And the best fit is...<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">D(1)<\/pre><pre style=\"font-style:oblique\">ans = \r\n            DistName: 'normal'\r\n               NLogL: 2.5148e+004\r\n                 BIC: 5.0314e+004\r\n                 AIC: 5.0300e+004\r\n                AICc: 5.0300e+004\r\n          ParamNames: {'mu'  'sigma'}\r\n    ParamDescription: {'location'  'scale'}\r\n              Params: [5.0093 2.9918]\r\n             Paramci: [2x2 double]\r\n            ParamCov: [2x2 double]\r\n             Support: [1x1 struct]\r\n<\/pre><p>Notice that it found the normal distribution as the best fit and the parameters (mu and sigma) to be close to the actual.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">fprintf(<span style=\"color: #A020F0\">'%10s\\t'<\/span>  , D(1).ParamNames{:}); fprintf(<span style=\"color: #A020F0\">'\\n'<\/span>);\r\nfprintf(<span style=\"color: #A020F0\">'%10.2f\\t'<\/span>, D(1).Params       ); fprintf(<span style=\"color: #A020F0\">'\\n'<\/span>);<\/pre><pre style=\"font-style:oblique\">        mu\t     sigma\t\r\n      5.01\t      2.99\t\r\n<\/pre><p><b>Comments<\/b><\/p>\r\n   <p>The function is very well documented with good examples. This is definitely a must-have for anyone doing statistical analysis\r\n      of real-world data. Give it a try and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=3211#respond\">here<\/a> or leave a <a href=\"#comments\">comment<\/a> for Mike.\r\n   <\/p>\r\n   <div>\r\n      <ol>\r\n         <li><a href=\"https:\/\/www.mathworks.com\/products\/statistics\/\">Statistics Toolbox<\/a><\/li>\r\n         <li><a href=\"https:\/\/www.mathworks.com\/products\/neural-network\/\">Neural Network Toolbox<\/a><\/li>\r\n         <li><a href=\"https:\/\/www.mathworks.com\/products\/sysid\/\">System Identification Toolbox<\/a><\/li>\r\n      <\/ol>\r\n   <\/div><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_9daca3682898441ba298db028f8e017d() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='9daca3682898441ba298db028f8e017d ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 9daca3682898441ba298db028f8e017d';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Jiro Doke';\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_9daca3682898441ba298db028f8e017d()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.13<br><\/p>\r\n<\/div>\r\n<!--\r\n9daca3682898441ba298db028f8e017d ##### SOURCE BEGIN #####\r\n%%\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/15007\r\n% Jiro>'s pick this week is\r\n% < |allfitdist|>\r\n% by <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/84146\r\n% Mike Sheppard>.\r\n%\r\n% As an application engineer, I go out and deliver seminars on various\r\n% topics (check out some upcoming\r\n% <https:\/\/www.mathworks.com\/company\/events\/seminars.html events>!),\r\n% and one of the topics that seem to drum up a lot of interest is data\r\n% modeling\/fitting. It's a very broad topic and spans pretty much all\r\n% industries. People are always trying to model some phenomena so that they\r\n% can use them to make predictions, understand characteristics, or\r\n% optimize. The techniques people can use vary based on what they are\r\n% trying to model.\r\n%\r\n% Probably the most common technique is\r\n% <http:\/\/en.wikipedia.org\/wiki\/Parametric_model parametric modeling>,\r\n% where you know the form (equation) of the model. There are different ways\r\n% of doing this in MATLAB, including commands like\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/techdoc\/ref\/polyfit.html |polyfit|> and\r\n% the <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/techdoc\/ref\/mldivide.html back slash\r\n% operator>. There are many other ways that span various techniques covered\r\n% by toolboxes, such as <https:\/\/www.mathworks.com\/products\/curvefitting\/\r\n% Curve Fitting Toolbox>, <https:\/\/www.mathworks.com\/products\/statistics\/\r\n% Statistics Toolbox>, and <https:\/\/www.mathworks.com\/products\/optimization\/\r\n% Optimization Toolbox>. \r\n%\r\n% During one of my seminars on these modeling techniques, a user came up to\r\n% me and asked me if it was possible to get an equation just by providing\r\n% some data (inputs and outputs). This is not possible without any\r\n% assumptions on the model, but I hear this question from time to time.\r\n% When I dig in a little bit, it turns out that, most of the time, people\r\n% have some idea for the form of the model, like power series, etc. But if\r\n% they truly want a black-box model, there are plenty of techniques out\r\n% there for doing that, such as\r\n% <http:\/\/en.wikipedia.org\/wiki\/Decision_tree_learning decision tree\r\n% learning> (1), <http:\/\/en.wikipedia.org\/wiki\/Artificial_neural_network\r\n% artificial neural networks> (2), and\r\n% <http:\/\/en.wikipedia.org\/wiki\/System_identification system\r\n% identification> (3).\r\n%\r\n% Back to the story... The question from this user at the seminar generated\r\n% a healthy discussion back at the office on how to address this type of\r\n% question. The key is that there is virtually an infinite number of\r\n% equations that could describe a data set. Without any constraints on the\r\n% form, it's impossible to return a single equation. But then one of my\r\n% colleagues pointed out that this type of question may be more reasonable\r\n% when it is about distribution fitting. It's a much smaller scope and\r\n% there may be a finite set of distributions that could be tested.\r\n%\r\n% This is where Mike's |allfitdist| comes into play. Statistics Toolbox\r\n% supports a long list of\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/stats\/bqt29ct.html distributions>,\r\n% including parametric and nonparametric distributions. |allfitdist| fits\r\n% all valid parametric distributions to the data and sorts them using a\r\n% metric you can use to compare the goodness of the fit.\r\n%\r\n% Here's an example of finding the best distribution fit for a random data\r\n% set with an assumed unknown continuous distribution (mu=5, sigma=3).\r\n\r\n% Create a normally distributed (mu: 5, sigma: 3) random data set\r\nx = normrnd(5, 3, 1e4, 1);\r\n\r\n% Compute and plot results. The results are sorted by \"Bayesian information\r\n% criterion\".\r\n[D, PD] = allfitdist(x, 'PDF');\r\n\r\n%%\r\n% And the best fit is...\r\n\r\nD(1)\r\n\r\n%%\r\n% Notice that it found the normal distribution as the best fit and the\r\n% parameters (mu and sigma) to be close to the actual.\r\n\r\nfprintf('%10s\\t'  , D(1).ParamNames{:}); fprintf('\\n');\r\nfprintf('%10.2f\\t', D(1).Params       ); fprintf('\\n');\r\n\r\n%%\r\n% *Comments*\r\n%\r\n% The function is very well documented with good examples. This is\r\n% definitely a must-have for anyone doing statistical analysis of\r\n% real-world data. Give it a try and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=3211#respond here> or leave a\r\n% <#comments\r\n% comment> for Mike.\r\n%\r\n% # <https:\/\/www.mathworks.com\/products\/statistics\/ Statistics Toolbox>\r\n% # <https:\/\/www.mathworks.com\/products\/neural-network\/ Neural Network\r\n% Toolbox>\r\n% # <https:\/\/www.mathworks.com\/products\/sysid\/ System Identification\r\n% Toolbox>\r\n\r\n##### SOURCE END ##### 9daca3682898441ba298db028f8e017d\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   Jiro's pick this week is allfitdist by Mike Sheppard.\r\n   \r\n   As an application engineer, I go out and deliver seminars on various topics (check out some upcoming events!), and one of the... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2012\/02\/10\/finding-the-best\/\">read more >><\/a><\/p>","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/3211"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=3211"}],"version-history":[{"count":7,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/3211\/revisions"}],"predecessor-version":[{"id":3219,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/3211\/revisions\/3219"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=3211"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=3211"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=3211"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}