Loren on the Art of MATLAB

July 23rd, 2008

Using MATLAB to Grade

Educators use MATLAB a lot. In addition to using MATLAB for research, many professors and instructors use MATLAB for teaching, including demonstrating and explaining concepts, creating class notes, and creating and collecting homework assignments and exams. Today I will show how you might use a dataset array from Statistic Toolbox to grade a set of student results that are already recorded.

Contents

What is a dataset Array?

A dataset array is basically two-dimensional array where each column holds data represented in a single data type, but the columns together may comprise many different data types.

A cell array can do this, but does not enforce the idea that all the elements in a given column must have the same type. To have names with a cell array, you either need to carry around an extra array, or have a special row or column to contain the label.

A scalar structure similarly can perform this service but but does not enforce the idea that each field must hold the same number of rows. Additionally, a dataset array can have labels that allow you to reference data not just by numeric or logical indexing, but by names as well.

Why Use a dataset Array?

You might want to use a dataset array if the data you have is natural to think of as collection of different but related entities. The relationships between the collections are more constrained (in this case, 1:1) than the flexibility afforded by cells or structs. The example I'm showing here, grading an assignment, shows some of the simpler ways in which you might use a dataset array.

Creating a dataset

I have an assignment with 5 questions, 2 T/F and 3 multiple choice (a:d). The questions have extraordinarily imaginative names.

qnames = {'Q1' 'Q2' 'Q3' 'Q4' 'Q5'};

I have collected the students' results in a sheet of a spreadsheet labeled "students" and the truth in a sheet labeled "truth". I can read each of these directly into a dataset array so the results are available for analysis.

answers = dataset('xlsfile','class answers.xls',...
    'sheet','students','ReadObsNames',true,...
    'ReadVarNames',false,'VarNames',qnames)
answers = 
                   Q1    Q2         Q3    Q4         Q5     
    Chris          0     'a'        0     'c'        'b'    
    Christine      0     'b'        0     'c'        'a'    
    Christopher    1     'a'        0     'c'        'a'    
    Kris           1     'a'        1     'b'        'c'    
    Kristen        1     'a'        0     'd'        'a'    

As you can see, all my students have similar names. In fact, I believe the most common root name for employees at MathWorks is this same root. Each row represents the results for a given student, and the columns represent the results for a given question.

Here are the "real" answers.

truth = dataset('xlsfile','class answers.xls',...
    'sheet','truth','ReadObsNames',true,...
    'ReadVarNames',false,'VarNames',qnames)
truth = 
               Q1    Q2         Q3    Q4         Q5     
    Answers    1     'a'        0     'c'        'a'    

Merge All into a Single dataset

I am placing the answer key in the top row of my array so I have the truth handy for comparison later.

alldata = [truth; answers]
alldata = 
                   Q1    Q2         Q3    Q4         Q5     
    Answers        1     'a'        0     'c'        'a'    
    Chris          0     'a'        0     'c'        'b'    
    Christine      0     'b'        0     'c'        'a'    
    Christopher    1     'a'        0     'c'        'a'    
    Kris           1     'a'        1     'b'        'c'    
    Kristen        1     'a'        0     'd'        'a'    

Transform the Underlying Data

As you can see, the answers show up as a mixture of string values, 1s, and 0s. If fact, I prefer to think of the 1s and 0s as true and false. Also, the string values are the results from multiple choice questions - where the answers are limited to values a through d. I would prefer to see the results reflected in the way I want to think about them. So I now transform the data, column by column.

for q = qnames
    % Get all the values for a question.
    vals = alldata.(q{1});
    % Change numeric values to logical.
    if isnumeric(vals)
        alldata.(q{1}) = logical(vals);
    else
    % Change non-numeric values to the collection 'a':'d'
        alldata.(q{1}) = nominal(vals,[],{'a','b','c','d'});
    end
end

Here's a summary of the transformed data.

summary(alldata)
Q1: [6x1 logical]
     true      false 
        4          2 
Q2: [6x1 nominal]
     a      b      c      d 
     5      1      0      0 
Q3: [6x1 logical]
     true      false 
        1          5 
Q4: [6x1 nominal]
     a      b      c      d 
     0      1      4      1 
Q5: [6x1 nominal]
     a      b      c      d 
     4      1      1      0 

It shows me, by column, the size and type of the data and, in this case, a count of how many results there are for each possible value.

You may have noticed that I converted the multiple choice columns to a type called nominal. The idea of a nominal array is constrain the values in the array to a specific collection of values. If all the acceptable values for the array are represented in the data, you often don't need more than the first input. Since some of my columns did not include all possible values a:d, I supplied these as the levels. Since they are strings, I choose to use them as the labels for the data as well as the values.

Gathering Information Per Question

I am now poised to gather information either by question or by student. Let's first look at the answers for questions 1 and 4. This is another dataset array.

q14Truth = alldata('Answers',{'Q1' 'Q4'})
q14Truth = 
               Q1       Q4
    Answers    true     c 

To get a single answer, I have another option. This returns a nominal value.

q4ans = alldata.Q4('Answers')
q4ans = 
     c 

I can get all the students' answers to a particular question. I get a nominal vector in return here since the Q4 column contains the answers to a multiple choice question.

q4all = alldata.Q4(2:end)
q4all = 
     c 
     c 
     c 
     b 
     d 

whos q4*
  Name       Size            Bytes  Class      Attributes

  q4all      5x1               314  nominal              
  q4ans      1x1               306  nominal              

And I can also find out all answers for one student, resulting in another dataset array.

ChrisAnswers = alldata('Chris',:)
ChrisAnswers = 
             Q1       Q2    Q3       Q4    Q5
    Chris    false    a     false    c     b 

Which Questions are Hard?

Suppose I want to find out which questions are hardest for this set of students. I can use the datasetfun function, similar to cellfun and arrayfun, to apply a function to each variable in the data. First I need to find out which questions students got right and wrong so I can compare their answers to the truth (row 1).

f = @(x) 100*sum(x(1)==x(2:end))/(size(alldata,1)-1)
percentRight = datasetfun(f,alldata)
f = 
    @(x)100*sum(x(1)==x(2:end))/(size(alldata,1)-1)
percentRight =
    60    80    80    60    60

Score the Assignments for Each Student

use datasetfun by comparing all elements in a column for students (i.e., 2:end) with first element of that column, the right answer. I make an effort to label the rows with the students' names here.

f = @(x) x(1)==x(2:end)
rightWrong = datasetfun(f,alldata,'DatasetOutput',true,...
    'ObsNames',alldata.Properties.ObsNames(2:end))
f = 
    @(x)x(1)==x(2:end)
rightWrong = 
                   Q1       Q2       Q3       Q4       Q5   
    Chris          false    true     true     true     false
    Christine      false    false    true     true     true 
    Christopher    true     true     true     true     true 
    Kris           true     true     false    false    false
    Kristen        true     true     true     false    true 

Next I sum across the rows to get scores for each student and add the score as the last column to the dataset. Remember: the first score is for the answer key so I set that score to 100.

alldata.grade = [100; ...
    100*sum(double(rightWrong),2)/size(alldata,2)]
alldata = 
                   Q1       Q2    Q3       Q4    Q5    grade
    Answers        true     a     false    c     a     100  
    Chris          false    a     false    c     b      60  
    Christine      false    b     false    c     a      60  
    Christopher    true     a     false    c     a     100  
    Kris           true     a     true     b     c      40  
    Kristen        true     a     false    d     a      80  

Notice that alldata now contains a new column with numeric values in addition to the nominal and logical ones.

How Can You See Using a dataset Array?

Can you see applications in which you'd be able to take advantage of a working with your data as a dataset array? Let me know here.


Get the MATLAB code

Published with MATLAB® 7.6

8 Responses to “Using MATLAB to Grade”

  1. Jessee replied on :

    I could potentially see myself using dataset for casually looking at data, but from an application standpoint where you might be processing large data sets I think I’d stick with structures and matrices.

    Is there any way to associate units with a column in the data set?

  2. Peter Perkins replied on :

    Jessee, there is a property that you can use to tag variables with units. For example,

    >> load hospital % sample file in Statistics Toolbox
    >> hospital.Properties.Units
    ans =
    ” ” ‘Yrs’ ‘Lbs’ ” ‘mm Hg’ ‘Counts’
    >> hospital.Weight = hospital.Weight/2.2;
    >> hospital.Properties.Units{4} = ‘kg’;

    The units also show up in the description of each dataset variable if you use the summary method on a dataset array. Note that these are just for the purpose of labelling, there is not any provision for conversions or units checking in math or anything like that.

    I’m curious about your comment about large datasets. Certainly if you have data that are homogeneous, you are better off using a matrix. But if your columns have different types, the dataset array is every bit as efficient as a scalar structure, and a good deal more efficient (memory-wise) than a structure array that has one element for every row of your data. It has the benefit over a scalar structure that you can easily subscript across dataset variables, which correspond to fields for the scalar atructure solution. And it allows you to use names for both dataset variables and observations.

  3. Dimitri Shvorob replied on :

    Kudos to Loren for highlighting this neat (relatively) new feature of Statistics Toolbox. I hope dataset arrays’ functionality will be expanding in forthcoming releases.

  4. Jessee replied on :

    Peter, I suppose the data I typically use is homogeneous in the sense that the columns are all doubles. You’re right though, the dataset array is better for mixed data types.

  5. jasmine replied on :

    Hi Loren,

    I am trying to store both numerical and categorical values to a dataset array. As the data size is big, I wan to initialize the dataset array so that I can speed up the operation. How can I do that?

    Many thanks!

  6. Peter Perkins replied on :

    Jasmine, I’m not exactly sure of the situation that you’re describing. There’s no reason to preallocate an array in MATLAB _just_ because it’s big. However, it is good practice to preallocate an array if, for example, you are going to fill it in one row at a time in a loop, especially if it will be big. So I’m guessing you mean something like, “I will be storing rows of data one a time and end up with a large dataset array, so I want to preallocate it.”

    The way to do that is more or less just as with any other array: create variables using, for example, ZEROS, and create a dataset array from those. Then overwrite each row as you get the real data.

    For categorical variables, it’s hard to say what your code should look like without knowing what data you are converting, but it will definitely be advantageous to “pre-define” the levels you care about, using the third input to the NOMINAL/ORDINAL constructor.

    Hope this helps.

  7. Sung Soo replied on :

    Honestly, I welcome this ‘dataset’ feature. BTW when was this introduced? I haven’t been aware of it.

    At first glance, it really looks like the basic data type in ‘R statistics language’. Though R language is (in my personal opinion) not modern at all and unintuitive, it has a great advantage on handling data. It is mainly because of its data structure, which is almost identical to ‘dataset’ introduced in this blog.

    Another advantage of R is its huge user base (most of them are previous users of SAS or SPSS), and their contribution to R with so many tools.

    I don’t want R’s not-modern programming style to creep into MATLAB’s statistical toolbox, but I really hope MATLAB can deal with essential features of R language. ‘dataset’ looks a very good thing to be added. The next thing I want from MATLAB is to provide broad range of essential functions that properly deal ‘NaN’, which is a missing data. If most functions of statistical toolbox has its ‘NaN’ version, it will be great help to most statistician.

  8. Peter Perkins replied on :

    Sung, the Statistics Toolbox has functions such as NANMEAN and NANSTD with “NAN” explicitly in the name, and those all treat NaNs as “missing values” and remove them. But many other Statistics Toolbox functions also treat NaN as a missing value flag, even if “NAN” is not in the name.

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Loren: Paul- There *are* issues depending on the sizes of ii and jj. And it’s a bit complicated, but really...
  • Loren: Bob- You don’t say what happens when you run your code. Can you please explain more. It looks like you...
  • Loren: Kishore- It is not clear to me what you are trying to actually achieve. If you want to concatenate the 4...
  • Kishore: sorry, in the previous code mat2cell(c,[19 121],[19 134],[19 84],[19 107])
  • Kishore: Hi Loren, Why does the following not work? data_classwise = [19x121 double] [19x134 double] [19x84 double]...
  • Paul Jackson: Loren, Are there any aspects of empty matrices that may be tricky when they are used as indices into...
  • Bob: Hi Lori, Im trying to process Unicode text files from more than one different locales than the standard latin...
  • Loren: Ben- The reference link in my post documents the behavior of sum([]) and prod([]) (although the prod part only...
  • Ben: Loren/Andrey, A further advantage of having sum([])==0 and prod([])==1 is that it’s consistent with array...
  • Loren: OysterEngineer- I will SO take you up on that offer. Can’t wait for a good reason to visit now....

These postings are the author's and don't necessarily represent the opinions of The MathWorks.