Loren on the Art of MATLAB

October 11th, 2007

A Way to Account for Missing Data

MATLAB has the concept of Not-a-Number, also known as NaN for quite some time. Following the IEEE 754 Standard for Binary Floating-Point Arithmetic, some floating point calculations result in NaN, for example, 0/0. You can also use them as placeholders in numeric arrays, for example to denote missing data. If you do so, how to you operate on these arrays and get answers that account for them as missing? I'll show an example here.

Contents

Sample Data Set

Let's create a dataset that has some missing values.

m = 10;
n = 3;
data = randn(m,n);
missing = abs(data) > 1.2;
data(missing) = NaN
data =
   -0.3999       NaN   -1.0106
    0.6900    0.2573    0.6145
    0.8156   -1.0565    0.5077
    0.7119       NaN       NaN
       NaN   -0.8051    0.5913
    0.6686    0.5287   -0.6436
    1.1908    0.2193    0.3803
       NaN   -0.9219   -1.0091
   -0.0198       NaN   -0.0195
   -0.1567   -0.0592   -0.0482

Calculating the Column Means

Now let's calculate the mean of the data, columnwise.

meanc = sum(data)/m
meanc =
   NaN   NaN   NaN

Assuming NaN indicates missing values, the mean that we've just calculated isn't very useful since the NaN values propagate into the mean.

Calculating the Column Means Accounting for NaN Values

Now let's try calculating the mean, while disregarding the missing values. To do so, first we need to find those values. Actually we will do this using logical indexing, a useful concept in MATLAB. We'll generate a matrix with logical values, i.e., true and false, true indicating locations where NaN values do not exist in our data.

notNaN = ~isnan(data)
notNaN =
     1     0     1
     1     1     1
     1     1     1
     1     0     0
     0     1     1
     1     1     1
     1     1     1
     0     1     1
     1     0     1
     1     1     1

Next we find out how many in each column are legitimate data values.

howMany = sum(notNaN)
howMany =
     8     7     9

We replace the missing data values with 0.

data(~notNaN) = 0
data =
   -0.3999         0   -1.0106
    0.6900    0.2573    0.6145
    0.8156   -1.0565    0.5077
    0.7119         0         0
         0   -0.8051    0.5913
    0.6686    0.5287   -0.6436
    1.1908    0.2193    0.3803
         0   -0.9219   -1.0091
   -0.0198         0   -0.0195
   -0.1567   -0.0592   -0.0482

Next we sum those values.

columnTot = sum(data)
columnTot =
    3.5006   -1.8373   -0.6373

And finally we compute the column means.

colMean = columnTot ./ howMany
colMean =
    0.4376   -0.2625   -0.0708

Generalizing to Other Dimensions

Statistics Toolbox contains functionality similar to what we've just stepped through with the function nanmean, and allows you to choose which dimension to calculate the mean along. In addition, the toolbox includes a suite of related functions for dealing with missing data.

Missing Any Data Yourself?

Do you work with data sets that have gaps or missing data? How do you handle them? Post your thoughts here.


Get the MATLAB code

Published with MATLAB® 7.5

12 Responses to “A Way to Account for Missing Data”

  1. Michel Slivitzky replied on :

    I am constantly working with missing data and using Matlab functions nanmean, nansum and nanstd.

    Some additional functions like nancorrcoef are also available in the newsgroup.

    why bother about substitution of zeros ?

  2. Loren replied on :

    Michel-

    I substitute zeros so I can do the sums without indexing. The reason is that not every column is guaranteed to have the same number of NaNs.

    –Loren

  3. Michel Slivitzky replied on :

    I do not see why it matters.
    If you want the sum over the columns do nansum over the columns; if you want the rows, nansum over the rows

    Y = nansum(X,dim)

    The only problem is that these functions are available only in the Statistics Toolbox

  4. Loren replied on :

    Michel-

    My point wasn’t to say to not use nansum but to show HOW to do this sort of operation in MATLAB.

    –Loren

  5. Jos vdG replied on :

    Another approach is to use accumarray

    notNaN = ~isnan(data) ;
    [r,c] = find(notNaN) ;
    r(:) = 1 ;
    colSum = accumarray([r c],data(notNaN))
    colProd = accumarray([r c],data(notNaN),[1,size(data,2),@prod)
    colMean = accumarray([r c],data(notNaN),[1,size(data,2),@mean)

    etc …

    Jos

  6. Duane Hanselman replied on :

    For those looking for a partial solution without the stats toolbox, there is #10235 on the file exchange. It demonstrates what Loren illustrates here for all common stat measures.

  7. Doug Hull replied on :

    I covered the uses of NaN in graphics in a movie on my blog earlier this week:

    http://blogs.mathworks.com/pick/2007/10/08/matlab-basics-video-using-nan-as-placeholder-data-in-graphics/

    Doug

  8. Matt G replied on :

    I run into this exact problem all the time. A few months ago I posted a function in the File Exchange named “ignorenan”. It also uses the accumarray function but handles n-d data and has an input to specify which dimension to operate on. Hopefully someone else may find it useful…

  9. Tim Davis replied on :

    Replacing NaN’s with zeros prior to heavy-duty computation is a good thing (assuming that it still gives you the right answer of course). NaN’s, Inf’s and the like cause the floating-point hardware to slow *way* down. Try this:

    A=rand(2000);
    B=rand(2000);
    C=nan(2000);
    tic; E = A+B ; toc
    tic; D = A+C ; toc

    The first computation takes 1.3 seconds on my Pentium 4 desktop in MATLAB 7.5. The 2nd takes 0.1 seconds. So NaN’s are great when used carefully, but keep an eye on performance if MATLAB seems sluggish when you abuse them.

  10. Chris Rodgers replied on :

    Is a sum over logicals, as in howMany=sum(isnan(DataMatrix)), optimized for speed? It seems inefficient to convert logicals to doubles and then add, especially since it already had to iterate through the entire matrix to do the isnan check.

  11. Loren replied on :

    Chris-

    MATLAB is smart enough to not convert the logicals to doubles in their entirety before doing the summation. You can convince yourself on windows by watching the task manager as you perform the operation on a large enough array.

    –Loren

  12. neuro11 replied on :

    hi…Loren thanx for the article,it was quite helpful for me.I was handling some data where i needed a NAN replacement..so nansum type function was not helpful.

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Jun: I totally can not believe it, Loren. You are really helpful. Thank you so much, MATLAB master!
  • Loren: Wow folks- Always lots of interest when there’s a quickie to try out! I will only make 2 general...
  • Loren: Jun- ismember is your friend here: >> [aa,ind] = ismember(Array2,Arra y1) aa = 1 1 1 1 1 1 1 ind = 1 2 1 4 4 3...
  • Dan: I like the first way better than the second way. Combining the arrays into one and running any is nice, although...
  • James Myatt: How about I = (a == 0 | b == 0); a(I) = []; b(I) = [];
  • Tunc: Hello Loren, love your blog because of such inspiring and challenging comments to such ’small’...
  • Pekka Kumpulainen: Here is my tradeoff. I usually want to keep the original variables as they are most probably...
  • Iain: Followup: Of course, to allow NaNs (counting them as non-zero): mask = (a~=0) & (b~=0); The mask says “a...
  • Matt Fig: I would usually go with something like this: y = a&b; x = a(y); y = b(y); But I was surprised to find...
  • kk: c=all([a;b]) a(c) a(b)

These postings are the author's and don't necessarily represent the opinions of The MathWorks.