Loren on the Art of MATLAB

Turn ideas into MATLAB

The Missing Link 15

Posted by Loren Shure,

Of course the data we collect is always perfect - NOT! Maybe yours is different. What can go wrong? So many things. Instruments drift, web sites go down, power goes out, ... So what can you do if you have gaps in your data, and the analysis you want to perform won't tolerate that condition? You might decide you need to fill in missing values.

We've been working on supplying functionality that makes dealing with missing data easier for a long time, starting with the introduction of NaN values right in the beginning. In a floating point array, NaNs act as placeholders. That's great, but what can you do from there?

Contents

Missing Capabilities

Some functions, or variants of them, work differently your array contains any NaN values, e.g., mean.

We first helped you figure out if you have missing values. And later added the ability to fill and remove missing values. More recently, we added the ability to mark missing values, even if you don't know the datatype of the array. This makes it easier to supply NaN, NaT (not a time) values, and similarly for categorical and string arrays, without needing to know which one is appropriate - as may happen with different columns in a table.

What Else is Missing?

Do you use the functionality to deal with missing values? If so, tell us how. If not, please tell us what is missing!

You can let us know here.


Get the MATLAB code

Published with MATLAB® R2018b

Note

Comments are closed.

15 CommentsOldest to Newest

David Barry replied on : 1 of 15
The ability to mark as missing looks like a really useful feature which I didn't know existed so thanks for sharing. Must read those release notes more!
Brad Stiritz replied on : 3 of 15
Hi Loren, very timely post, thank you. I was just talking with a colleague about this topic a few days ago. We would like our tabular objects to be extended as necessary, with "missing". Currently, this doesn't seem to be the default behavior, per example below. Any comments or suggestions appreciated. Thank you for your consideration.
>> T = array2table(magic(2))

T =

  2×2 table

    Var1    Var2
    ____    ____

     1       3  
     4       2  

% Ideally, the following operation would result in T{3,1} having the value "missing"
>> T{3,2} = 1
Warning: The assignment added rows to the table, but did not assign
values to all of the table's existing variables. Those variables are
extended with rows containing default values. 
> In tabular/subsasgnParens (line 415)
  In tabular/subsasgnBraces (line 154)
  In tabular/subsasgn (line 64) 

T =

  3×2 table

    Var1    Var2
    ____    ____

     1       3  
     4       2  
     0       1  
@Brad, I understand what you are asking for and why. And I think it's a great idea. There's not a direct way to do that currently from an existing table. You can do it with detectImportOptions and readtable if you are reading data in, rather than extending it. I can explain what is happening, but without a good workaround for you (yet!).
What's happening is the expansion is happening column by column, and when you grow column one without supplying values, MATLAB inserts 0 as it is numeric.
This is the same as having
A = magic(2)
A(3,2) = 1;

MATLAB then fills column 1 with a zero. And has forever.
I recommend you place an enhancement request for tables to have a way to fill data after being read in that allows something more than the current datatype default.
@Brad- A perhaps less than exciting workaround might go something like this:
T = array2table(magic(2))
newEntryT = array2table(NaN(1, size(T,2)))
newEntryT{1,2} = 17;  % for the ones where you want to set values.
newEntryT.Properties.VariableNames = T.Properties.VariableNames;
T = [T; newEntryT]
Still trying to figure out how to make it more elegant and how to incorporate some of the missing functionality naturally.
Sean de Wolski replied on : 6 of 15
I'd recommend against growing the table. In recent releases, you can preallocate it with the various types and standardizeMissing for the default value.
t = table('size',[3 2],'VariableTypes',{'double', 'string'})
standardizeMissing(t,0)

  3×2 table
    Var1      Var2   
    ____    _________
    NaN     
    NaN     
    NaN     
Rob W replied on : 7 of 15
The NaN is very useful (especially for plotting!), as is being able to omit NaNs from mean/median/stddev/min/max calculations (I think all those can be told to ignore NaNs). However, when I read in my spacecraft data, the Missing_Constant (or Fill-Value) is often a number, which I often replace with NaNs. Could the above codes be told to treat, for example, -1 (or other user defined value) as if it were a NaN and be excluded from mean/medians/etc. The second issue that'd be great is while knowing when data is missing is useful, for analysis I often have to put in something so I can still analyze the data (e.g. Fast Fourier Transforms) - so I end up interpolating the data over the gap to get a regularly spaced dataset. I usually interpolate to nearest value, e.g. using interp1 with the 'nearest' option. This is normally great, however if I've a data gaps of days that's bad for nearest. What I'd really want is interpolate to 'nearest if within a threshold of x units, else leave as NaN', e.g. use nearest value from the nearest record in time, as long as that time is less than 1 hour.
Brad Stiritz replied on : 8 of 15
Hi Sean, yes I have found as well that growing tabulars via indexing can be expensive in time (as of R2018b). I only mentioned this in my first comment b/c it's occasionally convenient and affordable. So are you suggesting preallocation in all cases, even when the tabular height is unknown at the outset? I have read discussion elsewhere on the site, suggesting to initially build out larger-scale tabular data as a struct vector or cell array, and then convert via xxx2table(). This performs nicely, at least for my needs (50K - 100K rows).
Brad Stiritz replied on : 9 of 15
p.s. Loren, there may be an issue with the commenting software. Your blog used to acknowledge submitted comments with "Your post is awaiting moderation." Lately, I haven't been finding any acknowledgement at all. This is a bit disconcerting.
@Brad- I don't know Sean's opinion, but sometimes it's not a big deal to grow something if it's convenient and not your bottleneck. A mixed strategy that generally works ok in MATLAB is to grow whatever you need in chunks - so you don't do so much reallocation. And use a marker so you know when you are done which final rows, if any, are ready to delete. --loren
@Rob- If you know the value for missing, you can substitute it with NaNs after reading, and then proceed. You can't simply declare a different value to be the missing one. As for the gap filling, you could use ismissing, find how long each gap is, and do different things with different gaps sizes. Not a 1-liner! --loren
Hyori Pak replied on : 13 of 15
Thank you for introducing the feature. I just realized the feature in MATLAB. It can be helpful to my daily job. thanks for sharing!
@Brad- apparently the moderated comment stuff is a consequence of us upgrading to the latest WordPress release. And therefore, the “Your comment is awaiting moderation” confirmation message no longer shows after posting a comment. @Hyori- glad to know this is helpful!