Handling Discrete Data
Discrete data arise in many applications and the data may be numeric, or non-numeric, often referred to as categorical. Not all data are strictly numeric, and other characteristics can be pertinent or useful. You can use a variety of techniques and data representations in MATLAB for storing and manipulation discrete data.
Contents
Example: Periodic Table of Elements
The periodic table of elements provides a rich basis for this discussion. If you remember your chemistry class (I did, very imperfectly), you probably remember that each element has a fixed number of protons associated with it, also called the atomic number. You might also remember that the periodic table is arranged so that certain columns of elements have certain characteristics. For example, apart from Hydrogen, the left-most column holds the alkali metals (such as sodium and potassium) and the left side generally contains metallic elements. The right-most column holds Noble gases, with the next column to their left holding halogens. Nonmetals are towards the right side of the table. At standard temperature and pressure (known as STP), elements are in one of three states: solid, liquid, or gas.
We've just talked about 3 different things:
Discrete Data Representation Options
I will try to use the periodic table example for each of the possible discrete data representations, but some of these will be forced examples.
logical
You could imagine use a logical variable to indicate elements that are in the Noble category as true and false for the remainder.
String or Coded Integer
logical variables are fine if you are representing exactly two possible categories. However, looking at the periodic table, you see that I collapsed a subset of categories into "not Noble" (false). Rather than collapsing the set of categories into Noble and not Noble, the entire collection might be better represented as integers that are arbitrarily given meaning, or by strings. For example, I could code alkali metals as 'alkali metals' or as 1, and so on. The first 3 elements of the table would be represented with [8 10 1] or as {'other nonmetals' 'noble gases' 'alkali metals'}.
Coded integers are useful for doing comparisons, for example, subsetting the data, and are memory-efficient. However, integers can cause you some mental overhead trying to remember the mapping between them and their labels.
Strings are good for representing the data in a readable form, but are harder to manipulate, especially for an ordered list such as phase states. Also you may prefer to use numeric type operations such as ==, ~=, and < since they are more direct and show intent. In addition, strings take more memory, especially when your data comprise many repetitions of the same value.
nominal Array
You have additional choices with nominal and ordinal arrays from Statistics Toolbox.
Let's imagine that we create an array representing the element categories by atomic number. The first 10 elements of this array look like this. I collapse some of the traditional categories for brevity.
catLabels = {'metals', 'metalloids','other nonmetals','halogens', ... 'noble gases', 'unknown'}; elemCats = nominal([3 5 1 1 2 3 3 3 4 5]', catLabels, 1:length(catLabels)) elemNames = {'hydrogen','helium','lithium','beryllium','boron', ... 'carbon','nitrogen','oxygen','fluorine','neon'}';
elemCats = other nonmetals noble gases metals metals metalloids other nonmetals other nonmetals other nonmetals halogens noble gases
Here is the list of all possible values for the categories. Notice that this nominal array carries that information with it. (You can modify this by adding, renaming, and deleting as necessary.)
getlabels(elemCats)
ans = Columns 1 through 5 'metals' 'metalloids' 'other nonmetals' 'halogens' 'noble gases' Column 6 'unknown'
Let's find all the 'other nonmetals' in the list.
nonmets = find(elemCats == 'other nonmetals')
elemNames(nonmets)
nonmets = 1 6 7 8 ans = 'hydrogen' 'carbon' 'nitrogen' 'oxygen'
ordinal Array
Let's investigate the states of the first 10 elements now. For some purposes, the states can be regarded as nominal. However, they can also be viewed as an ordered set, solid < liquid < gas. (At STP, only mercury and bromine are in liquid state.)
stLabels = {'solid' 'liquid' 'gas'}; elemSts = ordinal([3 3 1 1 1 1 3 3 3 3]', stLabels, 1:length(stLabels))
elemSts = gas gas solid solid solid solid gas gas gas gas
Here is the list of all possible values for the states.
getlabels(elemSts)
ans = 'solid' 'liquid' 'gas'
Let's find all the gases in the list.
gases = find(elemSts > 'liquid') % or == 'gas' elemNames(gases)
gases = 1 2 7 8 9 10 ans = 'hydrogen' 'helium' 'nitrogen' 'oxygen' 'fluorine' 'neon'
Let's sort the element list by state.
[sts, elemNo] = sort(elemSts) [cellstr(sts) elemNames(elemNo)]
sts = solid solid solid solid gas gas gas gas gas gas elemNo = 3 4 5 6 1 2 7 8 9 10 ans = 'solid' 'lithium' 'solid' 'beryllium' 'solid' 'boron' 'solid' 'carbon' 'gas' 'hydrogen' 'gas' 'helium' 'gas' 'nitrogen' 'gas' 'oxygen' 'gas' 'fluorine' 'gas' 'neon'
Integer
For the given list of elements, the atomic number happens to match exactly with the index into the elemNames array. So I only need to create the relevant integer values 1:10 to represent the atomic number. This list is itself clearly ordered, and has numerical meaning (unlike ordinal arrays which are ordered but the spacing between elements has no definition).
Going All the Way
To continue this example, you can collect all the information together into a single dataset array (from Statistics Toolbox). The advantages include being able to "index" by the element name!
atomicNo = (1:length(elemNames))';
periodicTable = dataset(atomicNo ,elemCats, elemSts, 'obsnames', elemNames)
periodicTable = atomicNo elemCats elemSts hydrogen 1 other nonmetals gas helium 2 noble gases gas lithium 3 metals solid beryllium 4 metals solid boron 5 metalloids solid carbon 6 other nonmetals solid nitrogen 7 other nonmetals gas oxygen 8 other nonmetals gas fluorine 9 halogens gas neon 10 noble gases gas
Your Data or Experiment
Do you have a situation where some of your data can be represented with nominal or ordinal arrays? How do you manage that information now? Let me know by posting here.
- Category:
- Best Practice,
- Less Used Functionality