## Loren on the Art of MATLABTurn ideas into MATLAB

Educators use MATLAB a lot. In addition to using MATLAB for research, many professors and instructors use MATLAB for teaching, including demonstrating and explaining concepts, creating class notes, and creating and collecting homework assignments and exams. Today I will show how you might use a dataset array from Statistic Toolbox to grade a set of student results that are already recorded.

### What is a dataset Array?

A dataset array is basically two-dimensional array where each column holds data represented in a single data type, but the columns together may comprise many different data types.

A cell array can do this, but does not enforce the idea that all the elements in a given column must have the same type. To have names with a cell array, you either need to carry around an extra array, or have a special row or column to contain the label.

A scalar structure similarly can perform this service but but does not enforce the idea that each field must hold the same number of rows. Additionally, a dataset array can have labels that allow you to reference data not just by numeric or logical indexing, but by names as well.

### Why Use a dataset Array?

You might want to use a dataset array if the data you have is natural to think of as collection of different but related entities. The relationships between the collections are more constrained (in this case, 1:1) than the flexibility afforded by cells or structs. The example I'm showing here, grading an assignment, shows some of the simpler ways in which you might use a dataset array.

### Creating a dataset

I have an assignment with 5 questions, 2 T/F and 3 multiple choice (a:d). The questions have extraordinarily imaginative names.

qnames = {'Q1' 'Q2' 'Q3' 'Q4' 'Q5'};

I have collected the students' results in a sheet of a spreadsheet labeled "students" and the truth in a sheet labeled "truth". I can read each of these directly into a dataset array so the results are available for analysis.

answers = dataset('xlsfile','class answers.xls',...
'ReadVarNames',false,'VarNames',qnames)
answers =
Q1    Q2         Q3    Q4         Q5
Chris          0     'a'        0     'c'        'b'
Christine      0     'b'        0     'c'        'a'
Christopher    1     'a'        0     'c'        'a'
Kris           1     'a'        1     'b'        'c'
Kristen        1     'a'        0     'd'        'a'


As you can see, all my students have similar names. In fact, I believe the most common root name for employees at MathWorks is this same root. Each row represents the results for a given student, and the columns represent the results for a given question.

truth = dataset('xlsfile','class answers.xls',...
'ReadVarNames',false,'VarNames',qnames)
truth =
Q1    Q2         Q3    Q4         Q5
Answers    1     'a'        0     'c'        'a'


### Merge All into a Single dataset

I am placing the answer key in the top row of my array so I have the truth handy for comparison later.

alldata = [truth; answers]
alldata =
Q1    Q2         Q3    Q4         Q5
Answers        1     'a'        0     'c'        'a'
Chris          0     'a'        0     'c'        'b'
Christine      0     'b'        0     'c'        'a'
Christopher    1     'a'        0     'c'        'a'
Kris           1     'a'        1     'b'        'c'
Kristen        1     'a'        0     'd'        'a'


### Transform the Underlying Data

As you can see, the answers show up as a mixture of string values, 1s, and 0s. If fact, I prefer to think of the 1s and 0s as true and false. Also, the string values are the results from multiple choice questions - where the answers are limited to values a through d. I would prefer to see the results reflected in the way I want to think about them. So I now transform the data, column by column.

for q = qnames
% Get all the values for a question.
vals = alldata.(q{1});
% Change numeric values to logical.
if isnumeric(vals)
alldata.(q{1}) = logical(vals);
else
% Change non-numeric values to the collection 'a':'d'
alldata.(q{1}) = nominal(vals,[],{'a','b','c','d'});
end
end

Here's a summary of the transformed data.

summary(alldata)
Q1: [6x1 logical]
true      false
4          2
Q2: [6x1 nominal]
a      b      c      d
5      1      0      0
Q3: [6x1 logical]
true      false
1          5
Q4: [6x1 nominal]
a      b      c      d
0      1      4      1
Q5: [6x1 nominal]
a      b      c      d
4      1      1      0


It shows me, by column, the size and type of the data and, in this case, a count of how many results there are for each possible value.

You may have noticed that I converted the multiple choice columns to a type called nominal. The idea of a nominal array is constrain the values in the array to a specific collection of values. If all the acceptable values for the array are represented in the data, you often don't need more than the first input. Since some of my columns did not include all possible values a:d, I supplied these as the levels. Since they are strings, I choose to use them as the labels for the data as well as the values.

### Gathering Information Per Question

I am now poised to gather information either by question or by student. Let's first look at the answers for questions 1 and 4. This is another dataset array.

q14Truth = alldata('Answers',{'Q1' 'Q4'})
q14Truth =
Q1       Q4


To get a single answer, I have another option. This returns a nominal value.

q4ans = alldata.Q4('Answers')
q4ans =
c



I can get all the students' answers to a particular question. I get a nominal vector in return here since the Q4 column contains the answers to a multiple choice question.

q4all = alldata.Q4(2:end)
q4all =
c
c
c
b
d


whos q4*
  Name       Size            Bytes  Class      Attributes

q4all      5x1               314  nominal
q4ans      1x1               306  nominal



And I can also find out all answers for one student, resulting in another dataset array.

ChrisAnswers = alldata('Chris',:)
ChrisAnswers =
Q1       Q2    Q3       Q4    Q5
Chris    false    a     false    c     b


### Which Questions are Hard?

Suppose I want to find out which questions are hardest for this set of students. I can use the datasetfun function, similar to cellfun and arrayfun, to apply a function to each variable in the data. First I need to find out which questions students got right and wrong so I can compare their answers to the truth (row 1).

f = @(x) 100*sum(x(1)==x(2:end))/(size(alldata,1)-1)
percentRight = datasetfun(f,alldata)
f =
@(x)100*sum(x(1)==x(2:end))/(size(alldata,1)-1)
percentRight =
60    80    80    60    60


### Score the Assignments for Each Student

use datasetfun by comparing all elements in a column for students (i.e., 2:end) with first element of that column, the right answer. I make an effort to label the rows with the students' names here.

f = @(x) x(1)==x(2:end)
rightWrong = datasetfun(f,alldata,'DatasetOutput',true,...
'ObsNames',alldata.Properties.ObsNames(2:end))
f =
@(x)x(1)==x(2:end)
rightWrong =
Q1       Q2       Q3       Q4       Q5
Chris          false    true     true     true     false
Christine      false    false    true     true     true
Christopher    true     true     true     true     true
Kris           true     true     false    false    false
Kristen        true     true     true     false    true


Next I sum across the rows to get scores for each student and add the score as the last column to the dataset. Remember: the first score is for the answer key so I set that score to 100.

alldata.grade = [100; ...
100*sum(double(rightWrong),2)/size(alldata,2)]
alldata =
Q1       Q2    Q3       Q4    Q5    grade
Answers        true     a     false    c     a     100
Chris          false    a     false    c     b      60
Christine      false    b     false    c     a      60
Christopher    true     a     false    c     a     100
Kris           true     a     true     b     c      40
Kristen        true     a     false    d     a      80


Notice that alldata now contains a new column with numeric values in addition to the nominal and logical ones.

### How Can You See Using a dataset Array?

Can you see applications in which you'd be able to take advantage of a working with your data as a dataset array? Let me know here.

Published with MATLAB® 7.6

|