Loren on the Art of MATLAB

From struct to dataset 14

Posted by Loren Shure,

When I got to work last Friday, I saw an email discussion, on behalf of a customer, trying to find a good way to add a new field to a struct array. So this post will start with that problem, and then show a different way to collect the same information, in a dataset array.

Contents

Initial struct and New Data

Let's create some information to store in a struct.

names = {'John'; 'Henri'};
ages = {26; 18};
initS = struct('Name', names, 'Age', ages);

Note that the ages data is a cell array. In addition to Name and Age, I have Height information in a numeric, not cell, array.

Heights = [168; 175];

How do I add this information to my struct? What follows are a series of possibilities, definitely not exhaustive!

First Pass - for loop

Let's start with a for loop. I add Height information to each element of the struct array, one at a time.

S1 = initS;
for index = 1:length(S1)
    S1(index).Height = 	Heights(index);
end

Second Pass - arrayfun

I can use arrayfun to remove the loop.

S2 = initS;
F = @(S,h) setfield(S, 'Height', h);
S2 = arrayfun(F, S2, Heights);

Third Pass - deal

If the data were in a cell array, I could easily distribute it to multiple outputs. Here I store the height data in a cell and deal it out.

S3 = initS;
cH = num2cell(Heights);
[S3.Height] = deal(cH{:});

Fourth Pass - Comma-separated List

If the data is in a cell array already, I can skip the step with deal and just dish out different cells to different outputs.

S4 = initS;
cH = num2cell(Heights);
[S4.Height] = cH{:};

Same Results?

Let's quickly check that we get the same results with each technique.

allsame = isequal(S1,S2,S3,S4)
allsame =
     1

What's the Data Look Like?

It's hard to look at the data here (in, e.g., S1) because the contents of each struct element is completely at the users's disposal. So I can look at one array element at a time.

S1(1)
ans = 
      Name: 'John'
       Age: 26
    Height: 168

Or I can look at all of the data in a single field at once.

[S1.Age]
ans =
    26    18

But I don't get to see all of the data in one glance.

Completely Different View

And now for something completely different. I've blogged before about dataset arrays from Statistics Toolbox. Here's another instance where one might be useful. I treat the columns like individual fields, and the rows as individual records. Each column contains data of a single datatype. Here's the data.

names = {'John'; 'Henri'}
ages = [26; 18];
d1 = dataset({names, 'Name'}, {ages, 'Age'})
names = 
    'John'
    'Henri'
d1 = 
    Name           Age
    'John'         26 
    'Henri'        18 

Two things to note here in contrast to using a struct to contain the information. First, the arguments appear in a different order in the two solutions. Second, the numeric data doesn't need to be placed in a cell array for the dataset, making the data management more natural, in my opinion.

Let me make a new dataset with additional data, heights.

d2 = dataset({names, 'Name'}, {[168 ;175] 'Height'})
d2 = 
    Name           Height
    'John'         168   
    'Henri'        175   

Concatenate dataset Arrays

Now let me collect the original dataset d1 with the new information in d2. Here are some ways to achieve this. First, just use square brackets ([]) as you would for regular array concatenation.

dnew1 = [d1 d2]
dnew1 = 
    Name           Age    Height
    'John'         26     168   
    'Henri'        18     175   

Another way to do this is to add the information in a struct-like way to the original dataset.

dnew2 = d1;
dnew2.Height = [168; 175]
dnew2 = 
    Name           Age    Height
    'John'         26     168   
    'Henri'        18     175   

Now let's make different dataset with new information, but with the order of the 2 entries swapped.

d3 = dataset({{'Henri'; 'John'}, 'Name'}, {[175; 168] 'Height'})
d3 = 
    Name           Height
    'Henri'        175   
    'John'         168   

What happens if we try to collect d1 and d3 together into one dataset?

try
    dnew3 = [d1 d3];
catch ExcDataset
    disp(ExcDataset.message)
end
Duplicate variable names with distinct data.

As you can see, I can't just collect them together via concatenation. However, I can combine or join the two datasets correctly.

dnew3 = join(d1,d2,'Name')
dnew3 = 
    Name           Age    Height
    'John'         26     168   
    'Henri'        18     175   

Notice how easily I can see all the data at once here, compared to the struct array.

How Do You Arrange Your Data?

Do you use either of these strategies for arranging your data (struct or dataset arrays)? Or do you do something different? I'd love to hear your experiences here.


Get the MATLAB code

Published with MATLAB® 7.8

14 CommentsOldest to Newest

FYI: the function dataset is NOT part of basic MATLAB. It is in the statistics toolbox.
Duane

The dataset type is extremely useful. Is there any chance of it being included in base MATLAB in the near future?

Duane and Ben-

Yes, the dataset array is part of Statistics Toolbox, as stated in the post. I recommend you request it become part of MATLAB (you can do this via the support link on right side of the blog).

Thanks.
–Loren

I have also requested this valuable feature be part of Matlab.

One set of free toolboxes that I have used in the past which have similar functionality (in some ways at least) can be found at http://www.bangor.ac.uk/~pss412/matlab_toolboxes.htm (don’t think it is on FileExchange). The utils toolbox has some dataset functions, and the pivottable has some useful dataset display functions. The dataset handling is more basic than Loren’s examples, but it might be useful to some.

Ms. Shure:

In the May 20 post “From struct to dataset”, what is the @ symbol in this line doing?

F = @(S,h) setfield(S, ‘Height’, h);

Chris-

The @ sign is letting me create an anonymous function in MATLAB. I then apply that function to each element in my array. It’s a great way to allow me to create and evaluate a function without using eval. There are some posts on this blog about them (under the category of Function Handles) and good information in the MATLAB documentation as well.

–Loren

I have been using a function to import csv files (with headers) as struct:

function data=z(filename)
datum=importdata(filename);
temp=num2cell(datum.data);
data=cell2struct(temp,datum.colheaders,2);

Now I’ll try to use datasets, as they seem easy to work with.

I recently discovered datasets in Matlab, and it seems appropriate for the the data I handle, that is genomic annotations: a table with fields name, chromosome, coordinates, etc. However the dataset type is not so easy to handle, propably because it is not as widely supported by Matlab classic functions as cell arrays. For instance, strfind works on cell arrays but not on datasets. I contributed a short code which converts the dataset to cell array in order to perform a search in the dataset (strfind for datasets, at http://www.mathworks.com/matlabcentral/fileexchange/24690).

Importantly I would like to mention also that calling an element of a dataset in a loop is very slow (several minutes for 17000 iterations) and is less that one second with a cell array. It is probably because the just in time compilation doesn’t work with datasets. I think this problem may strongly discourage people to use it, and it would be a big plus to have the JIT compilation working on datasets.

Finally, it may be nice (but not urgent) if more function would be available for datasets. For instance, it would be convenient to be able search fields with a syntax of the type d1(2,:)==3, or to assign with d(1:3,1:2)=X, like with usual arrays.

Cheers,

Arnaud

Arnaud, let me try to respond to each of your comments.

1) You’re right, there are not so many methods (so far) that work on a dataset array as a whole. Your example is strfind; let me try to explain the reasoning why strfind _doesn’t_ work, and what you might do instead.

A dataset array is intended to hold variables of different types. So, for example, you can’t add 1 to a dataset array, for the same reason you can’t add 1 to a cell array: addition would make no sense in general because the contents need not be numeric. You could argue that if all of the variables in the array were numeric, you should be able to add 1, analogous to the way various functions recognize a “cell array of strings” as a special case of cell arrays in general:

>> strfind({‘abc’ ‘def’ ‘ghi’},’abc’)
ans =
[1] [] []
>> strfind({‘abc’ ‘def’ ‘ghi’ 1:5},’abc’)
??? Error using ==> cell.strfind at 35
If any of the input arguments are cell arrays, the first must be
a cell array of strings and the second must be a character array.

But the dataset array class is just not intended to be a surrogate for a numeric array, or for a cell array of strings in that way.

What you _can_ do, however, is to apply strfind to each variable (or to a subset of variables) in a dataset array using datasetfun, with the burden being on you to make sure that those variables are suitable. For example,

datasetfun(@strfind,ds,{‘stringVar1′ ‘stringVar2′ …}, ‘uniformOutput’,false)

2) You’re right, high frequency access of individual values in a dataset array is slower than for numeric, cell, or structure arrays, and you’ve put your finger on one of the reasons. However, the dataset array class is really designed more with large vectorized operations in mind, operations such as “find the mean height for all subjects over the age of 30″, or “log transform the weights of each subject.” For those kinds of operations, the access time difference from numeric arrays is not an issue.

3) The two examples you cite _can_ be done, just using different kinds of subscripting. The reason why the syntaxes you list _don’t_ work is that parenthesis subscripting in MATLAB preserves type, and the operations you’ve shown mix types, where no automatic conversion exists. However:

d1.Var2(:) == 3 % instead of d1(2,:)==3

and

d{3,2} = 3 % instead of d(3,2) = 3

do work. Admittedly, d{1:3,1:2} = X is not supported. You can write that in two lines as

d.Var1(1:3) = X(:,1);
d.Var2(1:3) = X(:,2);

and perhaps use a loop for a larger number of columns. Or,

d(1:3,1:2) = dataset({X,’Var1′,’Var2′})

Or, depending on what you have, it may be possible to restructure the array to have a variable with two columns, and rephrase this as

ds.Var(1:3,:) = X

Thanks for your comments; feedback like this is helpful.

Dear Loren,

Thank you for your informative article. I have a question

Is there an easy way to find the common elements of two datasets (a kind of ‘intersect’ function based on dataset.Properties.ObsNames?

My digging in the documents and fiddling with ‘join’ hasn’t produced anything obvious.

Thank you!

Paul

Dear Loren,

I’m back, and I found the simple answer!

for two datasets, ds1 and ds2,

[c ia ib] = intersect(ds1.Properties.ObsNames,ds2.Properties.ObsNames)

gives indexes to the common observations; so ds1(ia,:) and ds2(ia,:) are matched row-by-row and can be concatenated horizontally.

paul

thank you Peter Perkins for your detailed answer.

Good to know that

 d1.Var2(:) == 3 

works. Actually, this is a fast rick to get a variable as a cell array, and helps solving the problem of using strfind:

strfind(ds.Var1,'a word')

works.

About the command

datasetfun(@strfind,ds,{’stringVar1′ ’stringVar2′ …}, ‘uniformOutput’,false)

I don’t see how to pass the expression ‘a word’ to strfind as an argument, but it does not matter now that I found the shorter way to use strfind.

About speed issues, I still recommend to convert first to cell array before using in a loop.

Thank you for the help,

Arnaud

I was just searching for information about dataset in matlab and found this article.
Did someone know if it’s possible to construct nested dataset (dataset of dataset).
It seem’s to work with 7.6 but not 7.8….

Thanks a lot for your help

Djames, can you be more specific about what you’re trying to do? There are at least a couple of dfferent things that would fit the description “dataset of dataset”. Thanks.

These postings are the author's and don't necessarily represent the opinions of MathWorks.