# From struct to dataset14

Posted by Loren Shure,

When I got to work last Friday, I saw an email discussion, on behalf of a customer, trying to find a good way to add a new field to a struct array. So this post will start with that problem, and then show a different way to collect the same information, in a dataset array.

### Initial struct and New Data

Let's create some information to store in a struct.

names = {'John'; 'Henri'};
ages = {26; 18};
initS = struct('Name', names, 'Age', ages);

Note that the ages data is a cell array. In addition to Name and Age, I have Height information in a numeric, not cell, array.

Heights = [168; 175];

How do I add this information to my struct? What follows are a series of possibilities, definitely not exhaustive!

### First Pass - for loop

Let's start with a for loop. I add Height information to each element of the struct array, one at a time.

S1 = initS;
for index = 1:length(S1)
S1(index).Height = 	Heights(index);
end

### Second Pass - arrayfun

I can use arrayfun to remove the loop.

S2 = initS;
F = @(S,h) setfield(S, 'Height', h);
S2 = arrayfun(F, S2, Heights);

### Third Pass - deal

If the data were in a cell array, I could easily distribute it to multiple outputs. Here I store the height data in a cell and deal it out.

S3 = initS;
cH = num2cell(Heights);
[S3.Height] = deal(cH{:});

### Fourth Pass - Comma-separated List

If the data is in a cell array already, I can skip the step with deal and just dish out different cells to different outputs.

S4 = initS;
cH = num2cell(Heights);
[S4.Height] = cH{:};

### Same Results?

Let's quickly check that we get the same results with each technique.

allsame = isequal(S1,S2,S3,S4)
allsame =
1


### What's the Data Look Like?

It's hard to look at the data here (in, e.g., S1) because the contents of each struct element is completely at the users's disposal. So I can look at one array element at a time.

S1(1)
ans =
Name: 'John'
Age: 26
Height: 168


Or I can look at all of the data in a single field at once.

[S1.Age]
ans =
26    18


But I don't get to see all of the data in one glance.

### Completely Different View

And now for something completely different. I've blogged before about dataset arrays from Statistics Toolbox. Here's another instance where one might be useful. I treat the columns like individual fields, and the rows as individual records. Each column contains data of a single datatype. Here's the data.

names = {'John'; 'Henri'}
ages = [26; 18];
d1 = dataset({names, 'Name'}, {ages, 'Age'})
names =
'John'
'Henri'
d1 =
Name           Age
'John'         26
'Henri'        18


Two things to note here in contrast to using a struct to contain the information. First, the arguments appear in a different order in the two solutions. Second, the numeric data doesn't need to be placed in a cell array for the dataset, making the data management more natural, in my opinion.

Let me make a new dataset with additional data, heights.

d2 = dataset({names, 'Name'}, {[168 ;175] 'Height'})
d2 =
Name           Height
'John'         168
'Henri'        175


### Concatenate dataset Arrays

Now let me collect the original dataset d1 with the new information in d2. Here are some ways to achieve this. First, just use square brackets ([]) as you would for regular array concatenation.

dnew1 = [d1 d2]
dnew1 =
Name           Age    Height
'John'         26     168
'Henri'        18     175


Another way to do this is to add the information in a struct-like way to the original dataset.

dnew2 = d1;
dnew2.Height = [168; 175]
dnew2 =
Name           Age    Height
'John'         26     168
'Henri'        18     175


Now let's make different dataset with new information, but with the order of the 2 entries swapped.

d3 = dataset({{'Henri'; 'John'}, 'Name'}, {[175; 168] 'Height'})
d3 =
Name           Height
'Henri'        175
'John'         168


What happens if we try to collect d1 and d3 together into one dataset?

try
dnew3 = [d1 d3];
catch ExcDataset
disp(ExcDataset.message)
end
Duplicate variable names with distinct data.


As you can see, I can't just collect them together via concatenation. However, I can combine or join the two datasets correctly.

dnew3 = join(d1,d2,'Name')
dnew3 =
Name           Age    Height
'John'         26     168
'Henri'        18     175


Notice how easily I can see all the data at once here, compared to the struct array.

### How Do You Arrange Your Data?

Do you use either of these strategies for arranging your data (struct or dataset arrays)? Or do you do something different? I'd love to hear your experiences here.

Get the MATLAB code

Published with MATLAB® 7.8

Duane Hanselman replied on : 1 of 14

FYI: the function dataset is NOT part of basic MATLAB. It is in the statistics toolbox.
Duane

Ben replied on : 2 of 14

The dataset type is extremely useful. Is there any chance of it being included in base MATLAB in the near future?

Loren replied on : 3 of 14

Duane and Ben-

Yes, the dataset array is part of Statistics Toolbox, as stated in the post. I recommend you request it become part of MATLAB (you can do this via the support link on right side of the blog).

Thanks.
–Loren

Kieran Parsons replied on : 4 of 14

I have also requested this valuable feature be part of Matlab.

One set of free toolboxes that I have used in the past which have similar functionality (in some ways at least) can be found at http://www.bangor.ac.uk/~pss412/matlab_toolboxes.htm (don’t think it is on FileExchange). The utils toolbox has some dataset functions, and the pivottable has some useful dataset display functions. The dataset handling is more basic than Loren’s examples, but it might be useful to some.

Chris Eklund replied on : 5 of 14

Ms. Shure:

In the May 20 post “From struct to dataset”, what is the @ symbol in this line doing?

F = @(S,h) setfield(S, ‘Height’, h);

Loren replied on : 6 of 14

Chris-

The @ sign is letting me create an anonymous function in MATLAB. I then apply that function to each element in my array. It’s a great way to allow me to create and evaluate a function without using eval. There are some posts on this blog about them (under the category of Function Handles) and good information in the MATLAB documentation as well.

–Loren

Marcelo replied on : 7 of 14

I have been using a function to import csv files (with headers) as struct:

function data=z(filename)
datum=importdata(filename);
temp=num2cell(datum.data);


Now I’ll try to use datasets, as they seem easy to work with.

Arnaud Amzallag replied on : 8 of 14

I recently discovered datasets in Matlab, and it seems appropriate for the the data I handle, that is genomic annotations: a table with fields name, chromosome, coordinates, etc. However the dataset type is not so easy to handle, propably because it is not as widely supported by Matlab classic functions as cell arrays. For instance, strfind works on cell arrays but not on datasets. I contributed a short code which converts the dataset to cell array in order to perform a search in the dataset (strfind for datasets, at http://www.mathworks.com/matlabcentral/fileexchange/24690).

Importantly I would like to mention also that calling an element of a dataset in a loop is very slow (several minutes for 17000 iterations) and is less that one second with a cell array. It is probably because the just in time compilation doesn’t work with datasets. I think this problem may strongly discourage people to use it, and it would be a big plus to have the JIT compilation working on datasets.

Finally, it may be nice (but not urgent) if more function would be available for datasets. For instance, it would be convenient to be able search fields with a syntax of the type d1(2,:)==3, or to assign with d(1:3,1:2)=X, like with usual arrays.

Cheers,

Arnaud

Peter Perkins replied on : 9 of 14

1) You’re right, there are not so many methods (so far) that work on a dataset array as a whole. Your example is strfind; let me try to explain the reasoning why strfind _doesn’t_ work, and what you might do instead.

A dataset array is intended to hold variables of different types. So, for example, you can’t add 1 to a dataset array, for the same reason you can’t add 1 to a cell array: addition would make no sense in general because the contents need not be numeric. You could argue that if all of the variables in the array were numeric, you should be able to add 1, analogous to the way various functions recognize a “cell array of strings” as a special case of cell arrays in general:

>> strfind({‘abc’ ‘def’ ‘ghi’},’abc’)
ans =
[1] [] []
>> strfind({‘abc’ ‘def’ ‘ghi’ 1:5},’abc’)
??? Error using ==> cell.strfind at 35
If any of the input arguments are cell arrays, the first must be
a cell array of strings and the second must be a character array.

But the dataset array class is just not intended to be a surrogate for a numeric array, or for a cell array of strings in that way.

What you _can_ do, however, is to apply strfind to each variable (or to a subset of variables) in a dataset array using datasetfun, with the burden being on you to make sure that those variables are suitable. For example,

datasetfun(@strfind,ds,{‘stringVar1′ ‘stringVar2′ …}, ‘uniformOutput’,false)

2) You’re right, high frequency access of individual values in a dataset array is slower than for numeric, cell, or structure arrays, and you’ve put your finger on one of the reasons. However, the dataset array class is really designed more with large vectorized operations in mind, operations such as “find the mean height for all subjects over the age of 30″, or “log transform the weights of each subject.” For those kinds of operations, the access time difference from numeric arrays is not an issue.

3) The two examples you cite _can_ be done, just using different kinds of subscripting. The reason why the syntaxes you list _don’t_ work is that parenthesis subscripting in MATLAB preserves type, and the operations you’ve shown mix types, where no automatic conversion exists. However:

d1.Var2(:) == 3 % instead of d1(2,:)==3

and

d{3,2} = 3 % instead of d(3,2) = 3

do work. Admittedly, d{1:3,1:2} = X is not supported. You can write that in two lines as

d.Var1(1:3) = X(:,1);
d.Var2(1:3) = X(:,2);

and perhaps use a loop for a larger number of columns. Or,

d(1:3,1:2) = dataset({X,’Var1′,’Var2′})

Or, depending on what you have, it may be possible to restructure the array to have a variable with two columns, and rephrase this as

ds.Var(1:3,:) = X

Dear Loren,

Thank you for your informative article. I have a question

Is there an easy way to find the common elements of two datasets (a kind of ‘intersect’ function based on dataset.Properties.ObsNames?

My digging in the documents and fiddling with ‘join’ hasn’t produced anything obvious.

Thank you!

Paul

Dear Loren,

I’m back, and I found the simple answer!

for two datasets, ds1 and ds2,

[c ia ib] = intersect(ds1.Properties.ObsNames,ds2.Properties.ObsNames)


gives indexes to the common observations; so ds1(ia,:) and ds2(ia,:) are matched row-by-row and can be concatenated horizontally.

paul

Arnaud Amzallag replied on : 12 of 14

Good to know that

 d1.Var2(:) == 3

works. Actually, this is a fast rick to get a variable as a cell array, and helps solving the problem of using strfind:

strfind(ds.Var1,'a word')


works.

datasetfun(@strfind,ds,{’stringVar1′ ’stringVar2′ …}, ‘uniformOutput’,false)

I don’t see how to pass the expression ‘a word’ to strfind as an argument, but it does not matter now that I found the shorter way to use strfind.

About speed issues, I still recommend to convert first to cell array before using in a loop.

Thank you for the help,

Arnaud

Djames replied on : 13 of 14