## Loren on the Art of MATLABTurn ideas into MATLAB

Note

Loren on the Art of MATLAB has been archived and will not be updated.

# Handling Discrete Data

Discrete data arise in many applications and the data may be numeric, or non-numeric, often referred to as categorical. Not all data are strictly numeric, and other characteristics can be pertinent or useful. You can use a variety of techniques and data representations in MATLAB for storing and manipulation discrete data.

### Example: Periodic Table of Elements

The periodic table of elements provides a rich basis for this discussion. If you remember your chemistry class (I did, very imperfectly), you probably remember that each element has a fixed number of protons associated with it, also called the atomic number. You might also remember that the periodic table is arranged so that certain columns of elements have certain characteristics. For example, apart from Hydrogen, the left-most column holds the alkali metals (such as sodium and potassium) and the left side generally contains metallic elements. The right-most column holds Noble gases, with the next column to their left holding halogens. Nonmetals are towards the right side of the table. At standard temperature and pressure (known as STP), elements are in one of three states: solid, liquid, or gas.

We've just talked about 3 different things:

• atomic number (number of protons in the nucleus), positive integers and therefore numeric
• solid, liquid, gas states, which can be considered ordinal (that is, solid < liquid < gas)
• element categories include the metals, nonmetals, etc., and these are simply labels or names, and therefore nominal (notice how these categories aren't strictly columnar in the periodic table)
• ### Discrete Data Representation Options

I will try to use the periodic table example for each of the possible discrete data representations, but some of these will be forced examples.

### logical

You could imagine use a logical variable to indicate elements that are in the Noble category as true and false for the remainder.

### String or Coded Integer

logical variables are fine if you are representing exactly two possible categories. However, looking at the periodic table, you see that I collapsed a subset of categories into "not Noble" (false). Rather than collapsing the set of categories into Noble and not Noble, the entire collection might be better represented as integers that are arbitrarily given meaning, or by strings. For example, I could code alkali metals as 'alkali metals' or as 1, and so on. The first 3 elements of the table would be represented with [8 10 1] or as {'other nonmetals' 'noble gases' 'alkali metals'}.

Coded integers are useful for doing comparisons, for example, subsetting the data, and are memory-efficient. However, integers can cause you some mental overhead trying to remember the mapping between them and their labels.

Strings are good for representing the data in a readable form, but are harder to manipulate, especially for an ordered list such as phase states. Also you may prefer to use numeric type operations such as ==, ~=, and < since they are more direct and show intent. In addition, strings take more memory, especially when your data comprise many repetitions of the same value.

### nominal Array

You have additional choices with nominal and ordinal arrays from Statistics Toolbox.

Let's imagine that we create an array representing the element categories by atomic number. The first 10 elements of this array look like this. I collapse some of the traditional categories for brevity.

catLabels = {'metals', 'metalloids','other nonmetals','halogens', ...
'noble gases', 'unknown'};
elemCats = nominal([3 5 1 1 2 3 3 3 4 5]', catLabels, 1:length(catLabels))
elemNames = {'hydrogen','helium','lithium','beryllium','boron', ...
'carbon','nitrogen','oxygen','fluorine','neon'}';
elemCats =
other nonmetals
noble gases
metals
metals
metalloids
other nonmetals
other nonmetals
other nonmetals
halogens
noble gases



Here is the list of all possible values for the categories. Notice that this nominal array carries that information with it. (You can modify this by adding, renaming, and deleting as necessary.)

getlabels(elemCats)
ans =
Columns 1 through 5
'metals'    'metalloids'    'other nonmetals'    'halogens'    'noble gases'
Column 6
'unknown'


Let's find all the 'other nonmetals' in the list.

nonmets = find(elemCats == 'other nonmetals')
elemNames(nonmets)
nonmets =
1
6
7
8
ans =
'hydrogen'
'carbon'
'nitrogen'
'oxygen'


### ordinal Array

Let's investigate the states of the first 10 elements now. For some purposes, the states can be regarded as nominal. However, they can also be viewed as an ordered set, solid < liquid < gas. (At STP, only mercury and bromine are in liquid state.)

stLabels = {'solid' 'liquid' 'gas'};
elemSts = ordinal([3 3 1 1 1 1 3 3 3 3]', stLabels, 1:length(stLabels))
elemSts =
gas
gas
solid
solid
solid
solid
gas
gas
gas
gas



Here is the list of all possible values for the states.

getlabels(elemSts)
ans =
'solid'    'liquid'    'gas'


Let's find all the gases in the list.

gases = find(elemSts > 'liquid') % or == 'gas'
elemNames(gases)
gases =
1
2
7
8
9
10
ans =
'hydrogen'
'helium'
'nitrogen'
'oxygen'
'fluorine'
'neon'


Let's sort the element list by state.

[sts, elemNo] = sort(elemSts)
[cellstr(sts) elemNames(elemNo)]
sts =
solid
solid
solid
solid
gas
gas
gas
gas
gas
gas

elemNo =
3
4
5
6
1
2
7
8
9
10
ans =
'solid'    'lithium'
'solid'    'beryllium'
'solid'    'boron'
'solid'    'carbon'
'gas'      'hydrogen'
'gas'      'helium'
'gas'      'nitrogen'
'gas'      'oxygen'
'gas'      'fluorine'
'gas'      'neon'


### Integer

For the given list of elements, the atomic number happens to match exactly with the index into the elemNames array. So I only need to create the relevant integer values 1:10 to represent the atomic number. This list is itself clearly ordered, and has numerical meaning (unlike ordinal arrays which are ordered but the spacing between elements has no definition).

### Going All the Way

To continue this example, you can collect all the information together into a single dataset array (from Statistics Toolbox). The advantages include being able to "index" by the element name!

atomicNo = (1:length(elemNames))';
periodicTable = dataset(atomicNo ,elemCats, elemSts, 'obsnames', elemNames)
periodicTable =
atomicNo    elemCats           elemSts
hydrogen      1          other nonmetals    gas
helium        2          noble gases        gas
lithium       3          metals             solid
beryllium     4          metals             solid
boron         5          metalloids         solid
carbon        6          other nonmetals    solid
nitrogen      7          other nonmetals    gas
oxygen        8          other nonmetals    gas
fluorine      9          halogens           gas
neon         10          noble gases        gas