{"id":110,"date":"2007-10-11T15:05:26","date_gmt":"2007-10-11T20:05:26","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/2007\/10\/11\/a-way-to-account-for-missing-data\/"},"modified":"2018-07-25T15:22:35","modified_gmt":"2018-07-25T20:22:35","slug":"a-way-to-account-for-missing-data","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2007\/10\/11\/a-way-to-account-for-missing-data\/","title":{"rendered":"A Way to Account for Missing Data"},"content":{"rendered":"<div class=\"content\">\n<p>MATLAB has the concept of <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/nan.html\">Not-a-Number<\/a>, also known as <tt>NaN<\/tt> for quite some time. Following the <a title=\"No Longer Working: https:\/\/en.wikipedia.org\/wiki\/IEEE_754\/\">IEEE 754 Standard for Binary Floating-Point Arithmetic<\/a>, some floating point calculations result in <tt>NaN<\/tt>, for example, <tt>0\/0<\/tt>. You can also use them as placeholders in numeric arrays, for example to denote missing data. If you do so, how to you operate<br \/>\non these arrays and get answers that account for them as missing? I'll show an example here.<\/p>\n<p>&nbsp;<\/p>\n<h3>Contents<\/h3>\n<div>\n<ul>\n<li><a href=\"#1\">Sample Data Set<\/a><\/li>\n<li><a href=\"#2\">Calculating the Column Means<\/a><\/li>\n<li><a href=\"#4\">Calculating the Column Means Accounting for NaN Values<\/a><\/li>\n<li><a href=\"#9\">Generalizing to Other Dimensions<\/a><\/li>\n<li><a href=\"#10\">Missing Any Data Yourself?<\/a><\/li>\n<\/ul>\n<\/div>\n<h3>Sample Data Set<a name=\"1\"><\/a><\/h3>\n<p>Let's create a dataset that has some missing values.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">m = 10;\r\nn = 3;\r\ndata = randn(m,n);\r\nmissing = abs(data) &gt; 1.2;\r\ndata(missing) = NaN<\/pre>\n<pre style=\"font-style: oblique;\">data =\r\n   -0.3999       NaN   -1.0106\r\n    0.6900    0.2573    0.6145\r\n    0.8156   -1.0565    0.5077\r\n    0.7119       NaN       NaN\r\n       NaN   -0.8051    0.5913\r\n    0.6686    0.5287   -0.6436\r\n    1.1908    0.2193    0.3803\r\n       NaN   -0.9219   -1.0091\r\n   -0.0198       NaN   -0.0195\r\n   -0.1567   -0.0592   -0.0482\r\n<\/pre>\n<h3>Calculating the Column Means<a name=\"2\"><\/a><\/h3>\n<p>Now let's calculate the mean of the data, columnwise.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">meanc = sum(data)\/m<\/pre>\n<pre style=\"font-style: oblique;\">meanc =\r\n   NaN   NaN   NaN\r\n<\/pre>\n<p>Assuming <tt>NaN<\/tt> indicates missing values, the mean that we've just calculated isn't very useful since the <tt>NaN<\/tt> values propagate into the mean.<\/p>\n<h3>Calculating the Column Means Accounting for NaN Values<a name=\"4\"><\/a><\/h3>\n<p>Now let's try calculating the mean, while disregarding the missing values. To do so, first we need to <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/find.html\"><tt>find<\/tt><\/a> those values. Actually we will do this using logical indexing, a useful concept in MATLAB. We'll generate a matrix with <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/logical.html\"><tt>logical<\/tt><\/a> values, i.e., <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/true.html\"><tt>true<\/tt><\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/false.html\"><tt>false<\/tt><\/a>, <tt>true<\/tt> indicating locations where <tt>NaN<\/tt> values do not exist in our data.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">notNaN = ~isnan(data)<\/pre>\n<pre style=\"font-style: oblique;\">notNaN =\r\n     1     0     1\r\n     1     1     1\r\n     1     1     1\r\n     1     0     0\r\n     0     1     1\r\n     1     1     1\r\n     1     1     1\r\n     0     1     1\r\n     1     0     1\r\n     1     1     1\r\n<\/pre>\n<p>Next we find out how many in each column are legitimate data values.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">howMany = sum(notNaN)<\/pre>\n<pre style=\"font-style: oblique;\">howMany =\r\n     8     7     9\r\n<\/pre>\n<p>We replace the missing <tt>data<\/tt> values with <tt>0<\/tt>.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">data(~notNaN) = 0<\/pre>\n<pre style=\"font-style: oblique;\">data =\r\n   -0.3999         0   -1.0106\r\n    0.6900    0.2573    0.6145\r\n    0.8156   -1.0565    0.5077\r\n    0.7119         0         0\r\n         0   -0.8051    0.5913\r\n    0.6686    0.5287   -0.6436\r\n    1.1908    0.2193    0.3803\r\n         0   -0.9219   -1.0091\r\n   -0.0198         0   -0.0195\r\n   -0.1567   -0.0592   -0.0482\r\n<\/pre>\n<p>Next we <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/sum.html\"><tt>sum<\/tt><\/a> those values.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">columnTot = sum(data)<\/pre>\n<pre style=\"font-style: oblique;\">columnTot =\r\n    3.5006   -1.8373   -0.6373\r\n<\/pre>\n<p>And finally we compute the column means.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">colMean = columnTot .\/ howMany<\/pre>\n<pre style=\"font-style: oblique;\">colMean =\r\n    0.4376   -0.2625   -0.0708\r\n<\/pre>\n<h3>Generalizing to Other Dimensions<a name=\"9\"><\/a><\/h3>\n<p><a href=\"https:\/\/www.mathworks.com\/products\/statistics\/\">Statistics Toolbox<\/a> contains functionality similar to what we've just stepped through with the function <a href=\"https:\/\/www.mathworks.com\/help\/stats\/nanmean.html\"><tt>nanmean<\/tt><\/a>, and allows you to choose which dimension to calculate the mean along. In addition, the toolbox includes a suite of related functions for dealing with missing data.<\/p>\n<h3>Missing Any Data Yourself?<a name=\"10\"><\/a><\/h3>\n<p>Do you work with data sets that have gaps or missing data? How do you handle them? Post your thoughts <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=110#respond\">here<\/a>.<\/p>\n<p><script>\/\/ <![CDATA[\nfunction grabCode_cf8d59115997461d9f3f2888422baec4() {\n        \/\/ Remember the title so we can use it in the new page\n        title = document.title;\n\n        \/\/ Break up these strings so that their presence\n        \/\/ in the Javascript doesn't mess up the search for\n        \/\/ the MATLAB code.\n        t1='cf8d59115997461d9f3f2888422baec4 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\n        t2='##### ' + 'SOURCE END' + ' #####' + ' cf8d59115997461d9f3f2888422baec4';\n    \n        b=document.getElementsByTagName('body')[0];\n        i1=b.innerHTML.indexOf(t1)+t1.length;\n        i2=b.innerHTML.indexOf(t2);\n \n        code_string = b.innerHTML.substring(i1, i2);\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\n\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \n        \/\/ in the XML parser.\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\n        \/\/ doesn't go ahead and substitute the less-than character. \n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\n\n        author = 'Loren Shure';\n        copyright = 'Copyright 2007 The MathWorks, Inc.';\n\n        w = window.open();\n        d = w.document;\n        d.write('\n\n\n\n\n\n<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\n\n\n\n\n\n\n\\n');\n      \n      d.title = title + ' (MATLAB code)';\n      d.close();\n      }\n\/\/ ]]><\/script><\/p>\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\"><a><span style=\"font-size: x-small; font-style: italic;\">Get<br \/>\nthe MATLAB code<br \/>\n<noscript>(requires JavaScript)<\/noscript><\/span><\/a><\/p>\n<p>Published with MATLAB\u00ae 7.5<\/p>\n<\/div>\n<p><!--\ncf8d59115997461d9f3f2888422baec4 ##### SOURCE BEGIN #####\n%% A Way to Account for Missing Data\n% MATLAB has the concept of\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/nan.html Not-a-Number>,\n% also known as |NaN| for quite some time.  Following the\n% <https:\/\/en.wikipedia.org\/wiki\/IEEE_754\/ IEEE 754 Standard for Binary Floating-Point Arithmetic>,\n% some floating point\n% calculations result in |NaN|, for example, |0\/0|.  You can also use them\n% as placeholders in numeric arrays, for example to denote missing data.\n% If you do so, how to you operate on these arrays and get answers that\n% account for them as missing?  I'll show an example here.\n%% Sample Data Set\n% Let's create a dataset that has some missing values.\nm = 10;\nn = 3;\ndata = randn(m,n);\nmissing = abs(data) > 1.2;\ndata(missing) = NaN\n%% Calculating the Column Means\n% Now let's calculate the mean of the data, columnwise.\nmeanc = sum(data)\/m\n%%\n% Assuming |NaN| indicates missing values, the mean that we've just\n% calculated isn't very useful since the |NaN| values propagate into the\n% mean.\n%% Calculating the Column Means Accounting for NaN Values\n% Now let's try calculating the mean, while disregarding the missing\n% values.  To do so, first we need to\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/find.html |find|>\n% those values.  Actually we will do this using\n% <https:\/\/www.mathworks.com\/access\/helpdesk\/help\/techdoc\/matlab_prog\/f1-85462.html#bq7egb6-1 logical indexing>,\n% a useful concept in MATLAB.  We'll generate a matrix with <https:\/\/www.mathworks.com\/help\/matlab\/ref\/logical.html |logical|>\n% values, i.e., <https:\/\/www.mathworks.com\/help\/matlab\/ref\/true.html |true|>\n% and <https:\/\/www.mathworks.com\/help\/matlab\/ref\/false.html |false|>,\n% |true| indicating locations where |NaN| values do not exist in our data.\nnotNaN = ~isnan(data)\n%%\n% Next we find out how many in each column are legitimate data values.\nhowMany = sum(notNaN)\n%%\n% We replace the missing |data| values with |0|.\ndata(~notNaN) = 0\n%%\n% Next we <https:\/\/www.mathworks.com\/help\/matlab\/ref\/sum.html |sum|>\n% those values.\ncolumnTot = sum(data)\n%%\n% And finally we compute the column means.\ncolMean = columnTot .\/ howMany\n%% Generalizing to Other Dimensions\n% <https:\/\/www.mathworks.com\/products\/statistics\/ Statistics Toolbox>\n% contains functionality similar to what we've just stepped through with\n% the function <https:\/\/www.mathworks.com\/help\/stats\/nanmean.html |nanmean|>, and\n% allows you to choose which dimension to calculate the mean along.  In\n% addition, the toolbox includes a <https:\/\/www.mathworks.com\/access\/helpdesk\/help\/toolbox\/stats\/bq_w_hm.html#bq_w_ie-5 suite of related functions>\n% for dealing with missing data.\n%% Missing Any Data Yourself?\n% Do you work with data sets that have gaps or missing data?  How do you\n% handle them?  Post your thoughts <https:\/\/blogs.mathworks.com\/loren\/?p=110#respond here>.\n\n##### SOURCE END ##### cf8d59115997461d9f3f2888422baec4\n--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\nMATLAB has the concept of Not-a-Number, also known as NaN for quite some time. Following the IEEE 754 Standard for Binary Floating-Point Arithmetic, some floating point calculations result in NaN,... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2007\/10\/11\/a-way-to-account-for-missing-data\/\">read more >><\/a><\/p>\n","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/110"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=110"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/110\/revisions"}],"predecessor-version":[{"id":3009,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/110\/revisions\/3009"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=110"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=110"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=110"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}