Loren on the Art of MATLAB

Turn ideas into MATLAB

Note

Loren on the Art of MATLAB has been archived and will not be updated.

A Way to Account for Missing Data

MATLAB has the concept of Not-a-Number, also known as NaN for quite some time. Following the IEEE 754 Standard for Binary Floating-Point Arithmetic, some floating point calculations result in NaN, for example, 0/0. You can also use them as placeholders in numeric arrays, for example to denote missing data. If you do so, how to you operate
on these arrays and get answers that account for them as missing? I'll show an example here.

 

Contents

Sample Data Set

Let's create a dataset that has some missing values.

m = 10;
n = 3;
data = randn(m,n);
missing = abs(data) > 1.2;
data(missing) = NaN
data =
   -0.3999       NaN   -1.0106
    0.6900    0.2573    0.6145
    0.8156   -1.0565    0.5077
    0.7119       NaN       NaN
       NaN   -0.8051    0.5913
    0.6686    0.5287   -0.6436
    1.1908    0.2193    0.3803
       NaN   -0.9219   -1.0091
   -0.0198       NaN   -0.0195
   -0.1567   -0.0592   -0.0482

Calculating the Column Means

Now let's calculate the mean of the data, columnwise.

meanc = sum(data)/m
meanc =
   NaN   NaN   NaN

Assuming NaN indicates missing values, the mean that we've just calculated isn't very useful since the NaN values propagate into the mean.

Calculating the Column Means Accounting for NaN Values

Now let's try calculating the mean, while disregarding the missing values. To do so, first we need to find those values. Actually we will do this using logical indexing, a useful concept in MATLAB. We'll generate a matrix with logical values, i.e., true and false, true indicating locations where NaN values do not exist in our data.

notNaN = ~isnan(data)
notNaN =
     1     0     1
     1     1     1
     1     1     1
     1     0     0
     0     1     1
     1     1     1
     1     1     1
     0     1     1
     1     0     1
     1     1     1

Next we find out how many in each column are legitimate data values.

howMany = sum(notNaN)
howMany =
     8     7     9

We replace the missing data values with 0.

data(~notNaN) = 0
data =
   -0.3999         0   -1.0106
    0.6900    0.2573    0.6145
    0.8156   -1.0565    0.5077
    0.7119         0         0
         0   -0.8051    0.5913
    0.6686    0.5287   -0.6436
    1.1908    0.2193    0.3803
         0   -0.9219   -1.0091
   -0.0198         0   -0.0195
   -0.1567   -0.0592   -0.0482

Next we sum those values.

columnTot = sum(data)
columnTot =
    3.5006   -1.8373   -0.6373

And finally we compute the column means.

colMean = columnTot ./ howMany
colMean =
    0.4376   -0.2625   -0.0708

Generalizing to Other Dimensions

Statistics Toolbox contains functionality similar to what we've just stepped through with the function nanmean, and allows you to choose which dimension to calculate the mean along. In addition, the toolbox includes a suite of related functions for dealing with missing data.

Missing Any Data Yourself?

Do you work with data sets that have gaps or missing data? How do you handle them? Post your thoughts here.

Published with MATLAB® 7.5


  • print

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.