Loren on the Art of MATLAB

July 23rd, 2008

Using MATLAB to Grade

Educators use MATLAB a lot. In addition to using MATLAB for research, many professors and instructors use MATLAB for teaching, including demonstrating and explaining concepts, creating class notes, and creating and collecting homework assignments and exams. Today I will show how you might use a dataset array from Statistic Toolbox to grade a set of student results that are already recorded.

Contents

What is a dataset Array?

A dataset array is basically two-dimensional array where each column holds data represented in a single data type, but the columns together may comprise many different data types.

A cell array can do this, but does not enforce the idea that all the elements in a given column must have the same type. To have names with a cell array, you either need to carry around an extra array, or have a special row or column to contain the label.

A scalar structure similarly can perform this service but but does not enforce the idea that each field must hold the same number of rows. Additionally, a dataset array can have labels that allow you to reference data not just by numeric or logical indexing, but by names as well.

Why Use a dataset Array?

You might want to use a dataset array if the data you have is natural to think of as collection of different but related entities. The relationships between the collections are more constrained (in this case, 1:1) than the flexibility afforded by cells or structs. The example I'm showing here, grading an assignment, shows some of the simpler ways in which you might use a dataset array.

Creating a dataset

I have an assignment with 5 questions, 2 T/F and 3 multiple choice (a:d). The questions have extraordinarily imaginative names.

qnames = {'Q1' 'Q2' 'Q3' 'Q4' 'Q5'};

I have collected the students' results in a sheet of a spreadsheet labeled "students" and the truth in a sheet labeled "truth". I can read each of these directly into a dataset array so the results are available for analysis.

answers = dataset('xlsfile','class answers.xls',...
    'sheet','students','ReadObsNames',true,...
    'ReadVarNames',false,'VarNames',qnames)
answers = 
                   Q1    Q2         Q3    Q4         Q5     
    Chris          0     'a'        0     'c'        'b'    
    Christine      0     'b'        0     'c'        'a'    
    Christopher    1     'a'        0     'c'        'a'    
    Kris           1     'a'        1     'b'        'c'    
    Kristen        1     'a'        0     'd'        'a'    

As you can see, all my students have similar names. In fact, I believe the most common root name for employees at MathWorks is this same root. Each row represents the results for a given student, and the columns represent the results for a given question.

Here are the "real" answers.

truth = dataset('xlsfile','class answers.xls',...
    'sheet','truth','ReadObsNames',true,...
    'ReadVarNames',false,'VarNames',qnames)
truth = 
               Q1    Q2         Q3    Q4         Q5     
    Answers    1     'a'        0     'c'        'a'    

Merge All into a Single dataset

I am placing the answer key in the top row of my array so I have the truth handy for comparison later.

alldata = [truth; answers]
alldata = 
                   Q1    Q2         Q3    Q4         Q5     
    Answers        1     'a'        0     'c'        'a'    
    Chris          0     'a'        0     'c'        'b'    
    Christine      0     'b'        0     'c'        'a'    
    Christopher    1     'a'        0     'c'        'a'    
    Kris           1     'a'        1     'b'        'c'    
    Kristen        1     'a'        0     'd'        'a'    

Transform the Underlying Data

As you can see, the answers show up as a mixture of string values, 1s, and 0s. If fact, I prefer to think of the 1s and 0s as true and false. Also, the string values are the results from multiple choice questions - where the answers are limited to values a through d. I would prefer to see the results reflected in the way I want to think about them. So I now transform the data, column by column.

for q = qnames
    % Get all the values for a question.
    vals = alldata.(q{1});
    % Change numeric values to logical.
    if isnumeric(vals)
        alldata.(q{1}) = logical(vals);
    else
    % Change non-numeric values to the collection 'a':'d'
        alldata.(q{1}) = nominal(vals,[],{'a','b','c','d'});
    end
end

Here's a summary of the transformed data.

summary(alldata)
Q1: [6x1 logical]
     true      false 
        4          2 
Q2: [6x1 nominal]
     a      b      c      d 
     5      1      0      0 
Q3: [6x1 logical]
     true      false 
        1          5 
Q4: [6x1 nominal]
     a      b      c      d 
     0      1      4      1 
Q5: [6x1 nominal]
     a      b      c      d 
     4      1      1      0 

It shows me, by column, the size and type of the data and, in this case, a count of how many results there are for each possible value.

You may have noticed that I converted the multiple choice columns to a type called nominal. The idea of a nominal array is constrain the values in the array to a specific collection of values. If all the acceptable values for the array are represented in the data, you often don't need more than the first input. Since some of my columns did not include all possible values a:d, I supplied these as the levels. Since they are strings, I choose to use them as the labels for the data as well as the values.

Gathering Information Per Question

I am now poised to gather information either by question or by student. Let's first look at the answers for questions 1 and 4. This is another dataset array.

q14Truth = alldata('Answers',{'Q1' 'Q4'})
q14Truth = 
               Q1       Q4
    Answers    true     c 

To get a single answer, I have another option. This returns a nominal value.

q4ans = alldata.Q4('Answers')
q4ans = 
     c 

I can get all the students' answers to a particular question. I get a nominal vector in return here since the Q4 column contains the answers to a multiple choice question.

q4all = alldata.Q4(2:end)
q4all = 
     c 
     c 
     c 
     b 
     d 

whos q4*
  Name       Size            Bytes  Class      Attributes

  q4all      5x1               314  nominal              
  q4ans      1x1               306  nominal              

And I can also find out all answers for one student, resulting in another dataset array.

ChrisAnswers = alldata('Chris',:)
ChrisAnswers = 
             Q1       Q2    Q3       Q4    Q5
    Chris    false    a     false    c     b 

Which Questions are Hard?

Suppose I want to find out which questions are hardest for this set of students. I can use the datasetfun function, similar to cellfun and arrayfun, to apply a function to each variable in the data. First I need to find out which questions students got right and wrong so I can compare their answers to the truth (row 1).

f = @(x) 100*sum(x(1)==x(2:end))/(size(alldata,1)-1)
percentRight = datasetfun(f,alldata)
f = 
    @(x)100*sum(x(1)==x(2:end))/(size(alldata,1)-1)
percentRight =
    60    80    80    60    60

Score the Assignments for Each Student

use datasetfun by comparing all elements in a column for students (i.e., 2:end) with first element of that column, the right answer. I make an effort to label the rows with the students' names here.

f = @(x) x(1)==x(2:end)
rightWrong = datasetfun(f,alldata,'DatasetOutput',true,...
    'ObsNames',alldata.Properties.ObsNames(2:end))
f = 
    @(x)x(1)==x(2:end)
rightWrong = 
                   Q1       Q2       Q3       Q4       Q5   
    Chris          false    true     true     true     false
    Christine      false    false    true     true     true 
    Christopher    true     true     true     true     true 
    Kris           true     true     false    false    false
    Kristen        true     true     true     false    true 

Next I sum across the rows to get scores for each student and add the score as the last column to the dataset. Remember: the first score is for the answer key so I set that score to 100.

alldata.grade = [100; ...
    100*sum(double(rightWrong),2)/size(alldata,2)]
alldata = 
                   Q1       Q2    Q3       Q4    Q5    grade
    Answers        true     a     false    c     a     100  
    Chris          false    a     false    c     b      60  
    Christine      false    b     false    c     a      60  
    Christopher    true     a     false    c     a     100  
    Kris           true     a     true     b     c      40  
    Kristen        true     a     false    d     a      80  

Notice that alldata now contains a new column with numeric values in addition to the nominal and logical ones.

How Can You See Using a dataset Array?

Can you see applications in which you'd be able to take advantage of a working with your data as a dataset array? Let me know here.


Get the MATLAB code

Published with MATLAB® 7.6

23 Responses to “Using MATLAB to Grade”

  1. Jessee replied on :

    I could potentially see myself using dataset for casually looking at data, but from an application standpoint where you might be processing large data sets I think I’d stick with structures and matrices.

    Is there any way to associate units with a column in the data set?

  2. Peter Perkins replied on :

    Jessee, there is a property that you can use to tag variables with units. For example,

    >> load hospital % sample file in Statistics Toolbox
    >> hospital.Properties.Units
    ans =
    ” ” ‘Yrs’ ‘Lbs’ ” ‘mm Hg’ ‘Counts’
    >> hospital.Weight = hospital.Weight/2.2;
    >> hospital.Properties.Units{4} = ‘kg’;

    The units also show up in the description of each dataset variable if you use the summary method on a dataset array. Note that these are just for the purpose of labelling, there is not any provision for conversions or units checking in math or anything like that.

    I’m curious about your comment about large datasets. Certainly if you have data that are homogeneous, you are better off using a matrix. But if your columns have different types, the dataset array is every bit as efficient as a scalar structure, and a good deal more efficient (memory-wise) than a structure array that has one element for every row of your data. It has the benefit over a scalar structure that you can easily subscript across dataset variables, which correspond to fields for the scalar atructure solution. And it allows you to use names for both dataset variables and observations.

  3. Dimitri Shvorob replied on :

    Kudos to Loren for highlighting this neat (relatively) new feature of Statistics Toolbox. I hope dataset arrays’ functionality will be expanding in forthcoming releases.

  4. Jessee replied on :

    Peter, I suppose the data I typically use is homogeneous in the sense that the columns are all doubles. You’re right though, the dataset array is better for mixed data types.

  5. jasmine replied on :

    Hi Loren,

    I am trying to store both numerical and categorical values to a dataset array. As the data size is big, I wan to initialize the dataset array so that I can speed up the operation. How can I do that?

    Many thanks!

  6. Peter Perkins replied on :

    Jasmine, I’m not exactly sure of the situation that you’re describing. There’s no reason to preallocate an array in MATLAB _just_ because it’s big. However, it is good practice to preallocate an array if, for example, you are going to fill it in one row at a time in a loop, especially if it will be big. So I’m guessing you mean something like, “I will be storing rows of data one a time and end up with a large dataset array, so I want to preallocate it.”

    The way to do that is more or less just as with any other array: create variables using, for example, ZEROS, and create a dataset array from those. Then overwrite each row as you get the real data.

    For categorical variables, it’s hard to say what your code should look like without knowing what data you are converting, but it will definitely be advantageous to “pre-define” the levels you care about, using the third input to the NOMINAL/ORDINAL constructor.

    Hope this helps.

  7. Sung Soo replied on :

    Honestly, I welcome this ‘dataset’ feature. BTW when was this introduced? I haven’t been aware of it.

    At first glance, it really looks like the basic data type in ‘R statistics language’. Though R language is (in my personal opinion) not modern at all and unintuitive, it has a great advantage on handling data. It is mainly because of its data structure, which is almost identical to ‘dataset’ introduced in this blog.

    Another advantage of R is its huge user base (most of them are previous users of SAS or SPSS), and their contribution to R with so many tools.

    I don’t want R’s not-modern programming style to creep into MATLAB’s statistical toolbox, but I really hope MATLAB can deal with essential features of R language. ‘dataset’ looks a very good thing to be added. The next thing I want from MATLAB is to provide broad range of essential functions that properly deal ‘NaN’, which is a missing data. If most functions of statistical toolbox has its ‘NaN’ version, it will be great help to most statistician.

  8. Peter Perkins replied on :

    Sung, the Statistics Toolbox has functions such as NANMEAN and NANSTD with “NAN” explicitly in the name, and those all treat NaNs as “missing values” and remove them. But many other Statistics Toolbox functions also treat NaN as a missing value flag, even if “NAN” is not in the name.

  9. Bassam replied on :

    I’ve am trying out dataset

    There doesn’t seem to be a way to change individual or subsets of VarNames in a dataset.

    You can create a dataset as documented

    d(1,2) = dataset({2,'name2'})
    
    d =
        name1
        1
    

    but if you try to add to it in an intuitive fashion the VarName specified is not used:

    d(1,2) = dataset({2,'name2'})
    d =
        name1    Var2
        1        2
    

    futhermore this doesn’t work either

    d = dataset({1,'name1'},{2,'name2'})
    d(1,2)   =dataset('VarNames',{'name3'})
    ??? Error using ==> setvarnames at 21
    NEWNAMES must have one name for each variable in A.
    
    Error in ==> dataset.dataset>dataset.dataset at 274
    

    is there any way to specify a 1 or subset of Varnames.

  10. Peter Perkins replied on :

    Bassam, I can’t tell from your description exactly what you intend to do, partially because there’s a cut-and-paste mistake (your first line). But let me take a shot at explaining what’s happening, and what you might do. First, set up some arrays:

    >> d = dataset({1,'name1'})
    d =
        name1
        1
    >> e = dataset({2,'name2'})
    e =
        name2
        2
    

    (These are 1×1 to make things short.)

    Now assign e *into an existing subset* of d.

    >> d(1,1) = e
    d =
        name1
        2
    

    Notice that d’s names don’t change. That’s intentional — it only assigns values. You were doing something more like this:

    >> d(:,2) = e
    d =
        name1    Var2
        2        2
    

    Even here, e’s name hasn’t carried over, because the same rule applies, for consistency — the names are not propagated from the RHS to the LHS if you *assign into*. So how to do what (I think) you want? You can explicitly specify a new name (even multiple names) as part of the assignment:

    >> d(:,'name2') = e
    d =
        name1    name2
        1        2
    

    Or you can concatenate:

    >> f = [d e]
    f =
        name1    name2
        1        2
    

    Both create the name you want as part of the assignment.

    If you want to change the name of an existing variable in a dataset array, you can assign directly to the name:

    >> d.Properties.VarNames{2} = 'name2'
    d =
        name1    name2
        1        2    
    

    The list of variable names is a cell array of strings, so you can assign to them all, or one, or even a single character of one. There’s also a SET method, similar to what you’d do with Handle Graphics.

    It’s apparent that the documentation was not sufficiently clear here, I’ll make a note to have that looked into. In the mean time, I hope this helps.

  11. Jun replied on :
    
    Dear Master,
    
    I have a dataset with 4 variables. Var1 and Var2 are for grouping, Var3 contains integer values as weights, and Var4 contains integers values as time in minutes. I need to calculate the weighted MEDIAN of Var4 by using Var3 as the weight in each group by grouping Var1 and Var2. Here is an example of the dataset:
    
    Var1  Var2   Var3   Var4
    AA     BB     23      5
    AA     BB     12      7
    AA     BB     50     10
    CC     DD      3    100
    CC     DD     10     59
    CC     DD      7     76
    CC     DD      5     10
    
    Could you please give me some help on that. Thank you very much in advance for your precious time!!!
    
    Jun
    
  12. Peter Perkins replied on :

    Jun, I can’t tell if you’re asking how to compute a weighted median, or if you’re asking how to compute a function on groups of data in a dataset.

    For the first, there are a couple of things on the MATLAB Central File Exchange.

    For the second, there is a function in the Statistics Toolbox, GRPSTATS, that computes summary statistics for groups in data, but unfortunately it works on one variable at a time, and your weighted median combines two variables into one statistic. I can’t think of anything that’s a “one-line” solution. You might try one of two things:

    1) Create a new variable to indicate which “group” a given observation is in, e.g., “AA_BB” or “AA_DD” or whatever. If Var1 and Var2 are character, I’d recommend concatenating data.Var1 and data.Var2 with some separator character, and then calling GRP2IDX. If they are nominal vectors, you can use the .* operator on data.Var1 and data.Var2 and then call GRP2IDX.

    Then just loop over the groups and use logical indexing on your new variable to pick out rows of the array and pass that to your weighted median function:

    [groupIndicator,gnames] = grp2idx(…);
    ngroups = length(gnames);
    for g = 1:ngroups
    i = (groupIndicator == g);
    wmed(g) = weightedMedian(data.Var4(i),data.Var3(i));
    end

    2) It is also possible to create a new dataset that has, instead of separate column vectors for Var3 and Var4, a single variable with two columns. You can then use GRPSTATS to work on that, but it involves some tricks that may be more trouble than they’re worth.

    Good luck. Hope this helps.

  13. Jun replied on :
    
    Dear Peter,
    
    You are exactly writing the code for my question(the second one). I followed your instruction to create a new group variable by combining Var1 and Var2. I have tried it on my small test data and it works very well. My actual data has millions records on a server, I will try it later to see if it still works efficiently. If I have any more question, I will let you know.
    
    Thank you very much for your kind help. You make my life full of sunshine now. Thanks again, Peter!!! Have a nice weekend!
    
    Best regards,
    
    Jun
    
  14. Roman replied on :

    I use dataset with a large mixed matrix of values and strings
    I have one variable with 14’500 maturity dates of bonds. Unfortunately the maturity dates come in 2 varieties:
    1.) dd.mm.yyyy
    2.) dd-mmm-yyyy
    changing to more appropriate date formats works fine:
    data.MaturityDate=arrayfun(@(x)strrep(x,’.',’/'),data.MaturityDate);
    data.MaturityDate=arrayfun(@(x)strrep(x,’-',’/'),data.MaturityDate);
    However, trying to be efficient with dates by using datenum does not work. I tried many variants, even the following loop does not work:
    for i=1,size(data,1);
    if length(data.MaturityDate{i})==10;
    data.MaturityDate{i}=datenum(data.MaturityDate{i},’dd/mm/yyyy’);
    elseif length(data.MaturityDate{i})==11;
    data.MaturityDate{i}=datenum(data.MaturityDate{i},’dd/mmm/yyyy’);
    else
    t=’X’
    end
    end

    Reading the blog I tried to use datasetfun

    However I cannot figure out how to define (if possible) a datasetfun like:

    f=@(x) length(x)==10, x=datenum(x,’dd/mm/yyyy’),…
    length(x)==11, x=datenum(x,’dd/mmm/yyyy’)

    data.MaturityDate=datasetfun(f,dataMaturityDate);

    Is there an efficient solution?

    Regards Roman

  15. Peter Perkins replied on :

    Roman, I’m not exactly sure what’s what in your code snippets, so I’ll have to take a guess. data looks like it might be a dataset array, with MaturityDate as one of its variables, and MaturityDate looks like it might be a cell array of datestrs where some are like ’01.02.2003′ and some are like ’02-Mar-2004′. It appears that you’ve applied arrayfun to that cell array to get all “/”‘s, and reassign in place.

    But I’m not sure what you’re up to with datasetfun. It is intended to apply the same function to several variables in a dataset array, and since you only seem to be working on one variable, there’d be no point in using it.

    If all you want to do is to convert MaturityDate from datestr to datenum, you could use the same arrayfun strategy. But the following is a vectorized version:

    MaturityDate = {'01.02.2003'; '02-Mar-2004'; '03.04.2005'; '04-May-2006'; '05.06.2007'};
    data = dataset(MaturityDate)
    
    MaturityDateDN = zeros(length(data.MaturityDate),1);
    fmt10 = cellfun(@(x)length(x)==10,data.MaturityDate);
    MaturityDateDN(fmt10) = datenum(data.MaturityDate(fmt10),'dd.mm.yyyy');
    fmt11 = ~fmt10;
    MaturityDateDN(fmt11) = datenum(data.MaturityDate(fmt11),'dd-mmm-yyyy');
    
    data.MaturityDate = MaturityDateDN
    

    When I run this, data starts out as

    data =
    MaturityDate
    ’01.02.2003′
    ’02-Mar-2004′
    ’03.04.2005′
    ’04-May-2006′
    ’05.06.2007′

    and ends up as

    data =
    MaturityDate
    7.3161e+05
    7.3201e+05
    7.324e+05
    7.328e+05
    7.332e+05

  16. Roman replied on :

    Thanks, Peter

    Yes, I do use and need dataset, because my data is 14514 rows and 36 columns with different types of bond variables:
    - many currencies
    - Price, Yield, Duration, ISIN number information of different bonds
    - Ratings, Sectors, Names, Maturities
    I try to replicate the webinar about dataset arrays. Goal is: extract out of large universe subsamples of bonds with predefined characteristics such as certain ratings, sectors, within certain time-to-maturity brackets. See following code:

    data( data.Rating >= ‘AA3′, {‘Ticker’, ‘YldToWorst’, ‘Rating’})

    data( data.Rating >= ‘BBB3′ & data.Rating <=’A1′ & data.MLIndustryLvl2 == ‘Industrials’ …
    & data.MLIndustryLvl3 == ‘Healthcare’ & data.ISOCurrency ==’USD’ & data.MaturityDate <= datenum(‘Dec-01-2014′), …
    {‘Ticker’, ‘YldToWorst’, ‘ModDurToWorst’,'Rating’})

    Kind regards Roman

  17. Roman replied on :

    May I further ask:

    % Extract all bonds with currency EUR and rating between BBB3 and A1 and not being Financials:
    data( data.Rating >= ‘BBB3′ & data.Rating <=’A1′ & ~data.MLIndustryLvl2 == ‘Financial’ …
    & data.ISOCurrency ==’EUR’, {‘Description’,'Ticker’,'MLIndustryLvl3′,’YldToWorst’,'OAS’, ‘ModDurToWorst’,'Rating’})
    Matlab error message:
    ??? Undefined function or method ‘not’ for input arguments of type ‘nominal’.

    How can I efficiently code and then screen my database for all bonds but the ‘financial’ ones (in MLIndustryLvl2 I have 7 different bond types, in ML IndustryLvl3 it’s 31 and Lvl4 is 88)

  18. Peter Perkins replied on :

    Roman, it looks like you’ve got the hang of using nominal/ordinal variables in a dataset for creating subsets of your data. The only problem I see in your code is this:

    ~data.MLIndustryLvl2 == ‘Financial’
    

    Consider the precedence of the operators “~” and “==”. data.MLIndustryLvl2 is not a logical, so you want either

    ~(data.MLIndustryLvl2 == ‘Financial’)
    

    or

    data.MLIndustryLvl2 ~= ‘Financial’
    
  19. Patrick de Man replied on :

    What if I want to use as a variable name ‘RT-DA’?
    The dataset heading becomes: RT0x2FDA, as the – seems to be an invalid character for the dataset.

  20. Loren replied on :

    Patrick-

    Since RT-DA is not a valid identifier name in MATLAB, so the name gets replaced with one that can be used in MATLAB.

    –Loren

  21. Bob replied on :

    Where can I find good reference for using dataset arrays beyond the Matlab help?

    Specifically, how can I add two variables in a dataset?

  22. Loren Shure replied on :

    Bob-

    I am not aware of other documentation. There are some other posts in this overall blog that use dataset arrays as well. Perhaps those would be useful to you.

    –Loren

  23. Peter Perkins replied on :

    Bob, the Statistics Toolbox User Guide does have a section on dataset arrays if you haven’t looked there:

    http://www.mathworks.com/help/toolbox/stats/bqziht7-1.html#bqzihxq

    To answer your specific question, you can add two variables in a dataset array by just adding them to create a new variable:

    d = dataset(randn(10,1),randn(10,1));
    d.Var3 = d.Var1 + d.Var2;

    Hope this helps.


MathWorks
Loren Shure works on design of the MATLAB language at MathWorks. She writes here about once a week on MATLAB programming and related topics.

These postings are the author's and don't necessarily represent the opinions of The MathWorks.