Loren on the Art of MATLAB

Turn ideas into MATLAB

Note

Loren on the Art of MATLAB has been archived and will not be updated.

Handling Discrete Data

Discrete data arise in many applications and the data may be numeric, or non-numeric, often referred to as categorical. Not all data are strictly numeric, and other characteristics can be pertinent or useful. You can use a variety of techniques and data representations in MATLAB for storing and manipulation discrete data.

Contents

Example: Periodic Table of Elements

The periodic table of elements provides a rich basis for this discussion. If you remember your chemistry class (I did, very imperfectly), you probably remember that each element has a fixed number of protons associated with it, also called the atomic number. You might also remember that the periodic table is arranged so that certain columns of elements have certain characteristics. For example, apart from Hydrogen, the left-most column holds the alkali metals (such as sodium and potassium) and the left side generally contains metallic elements. The right-most column holds Noble gases, with the next column to their left holding halogens. Nonmetals are towards the right side of the table. At standard temperature and pressure (known as STP), elements are in one of three states: solid, liquid, or gas.

We've just talked about 3 different things:

  • atomic number (number of protons in the nucleus), positive integers and therefore numeric
  • solid, liquid, gas states, which can be considered ordinal (that is, solid < liquid < gas)
  • element categories include the metals, nonmetals, etc., and these are simply labels or names, and therefore nominal (notice how these categories aren't strictly columnar in the periodic table)
  • Discrete Data Representation Options

    I will try to use the periodic table example for each of the possible discrete data representations, but some of these will be forced examples.

    logical

    You could imagine use a logical variable to indicate elements that are in the Noble category as true and false for the remainder.

    String or Coded Integer

    logical variables are fine if you are representing exactly two possible categories. However, looking at the periodic table, you see that I collapsed a subset of categories into "not Noble" (false). Rather than collapsing the set of categories into Noble and not Noble, the entire collection might be better represented as integers that are arbitrarily given meaning, or by strings. For example, I could code alkali metals as 'alkali metals' or as 1, and so on. The first 3 elements of the table would be represented with [8 10 1] or as {'other nonmetals' 'noble gases' 'alkali metals'}.

    Coded integers are useful for doing comparisons, for example, subsetting the data, and are memory-efficient. However, integers can cause you some mental overhead trying to remember the mapping between them and their labels.

    Strings are good for representing the data in a readable form, but are harder to manipulate, especially for an ordered list such as phase states. Also you may prefer to use numeric type operations such as ==, ~=, and < since they are more direct and show intent. In addition, strings take more memory, especially when your data comprise many repetitions of the same value.

    nominal Array

    You have additional choices with nominal and ordinal arrays from Statistics Toolbox.

    Let's imagine that we create an array representing the element categories by atomic number. The first 10 elements of this array look like this. I collapse some of the traditional categories for brevity.

    catLabels = {'metals', 'metalloids','other nonmetals','halogens', ...
        'noble gases', 'unknown'};
    elemCats = nominal([3 5 1 1 2 3 3 3 4 5]', catLabels, 1:length(catLabels))
    elemNames = {'hydrogen','helium','lithium','beryllium','boron', ...
        'carbon','nitrogen','oxygen','fluorine','neon'}';
    elemCats = 
         other nonmetals 
         noble gases 
         metals 
         metals 
         metalloids 
         other nonmetals 
         other nonmetals 
         other nonmetals 
         halogens 
         noble gases 
    
    

    Here is the list of all possible values for the categories. Notice that this nominal array carries that information with it. (You can modify this by adding, renaming, and deleting as necessary.)

    getlabels(elemCats)
    ans = 
      Columns 1 through 5
        'metals'    'metalloids'    'other nonmetals'    'halogens'    'noble gases'
      Column 6
        'unknown'
    

    Let's find all the 'other nonmetals' in the list.

    nonmets = find(elemCats == 'other nonmetals')
    elemNames(nonmets)
    nonmets =
         1
         6
         7
         8
    ans = 
        'hydrogen'
        'carbon'
        'nitrogen'
        'oxygen'
    

    ordinal Array

    Let's investigate the states of the first 10 elements now. For some purposes, the states can be regarded as nominal. However, they can also be viewed as an ordered set, solid < liquid < gas. (At STP, only mercury and bromine are in liquid state.)

    stLabels = {'solid' 'liquid' 'gas'};
    elemSts = ordinal([3 3 1 1 1 1 3 3 3 3]', stLabels, 1:length(stLabels))
    elemSts = 
         gas 
         gas 
         solid 
         solid 
         solid 
         solid 
         gas 
         gas 
         gas 
         gas 
    
    

    Here is the list of all possible values for the states.

    getlabels(elemSts)
    ans = 
        'solid'    'liquid'    'gas'
    

    Let's find all the gases in the list.

    gases = find(elemSts > 'liquid') % or == 'gas'
    elemNames(gases)
    gases =
         1
         2
         7
         8
         9
        10
    ans = 
        'hydrogen'
        'helium'
        'nitrogen'
        'oxygen'
        'fluorine'
        'neon'
    

    Let's sort the element list by state.

    [sts, elemNo] = sort(elemSts)
    [cellstr(sts) elemNames(elemNo)]
    sts = 
         solid 
         solid 
         solid 
         solid 
         gas 
         gas 
         gas 
         gas 
         gas 
         gas 
    
    elemNo =
         3
         4
         5
         6
         1
         2
         7
         8
         9
        10
    ans = 
        'solid'    'lithium'  
        'solid'    'beryllium'
        'solid'    'boron'    
        'solid'    'carbon'   
        'gas'      'hydrogen' 
        'gas'      'helium'   
        'gas'      'nitrogen' 
        'gas'      'oxygen'   
        'gas'      'fluorine' 
        'gas'      'neon'     
    

    Integer

    For the given list of elements, the atomic number happens to match exactly with the index into the elemNames array. So I only need to create the relevant integer values 1:10 to represent the atomic number. This list is itself clearly ordered, and has numerical meaning (unlike ordinal arrays which are ordered but the spacing between elements has no definition).

    Going All the Way

    To continue this example, you can collect all the information together into a single dataset array (from Statistics Toolbox). The advantages include being able to "index" by the element name!

    atomicNo = (1:length(elemNames))';
    periodicTable = dataset(atomicNo ,elemCats, elemSts, 'obsnames', elemNames)
    periodicTable = 
                     atomicNo    elemCats           elemSts
        hydrogen      1          other nonmetals    gas    
        helium        2          noble gases        gas    
        lithium       3          metals             solid  
        beryllium     4          metals             solid  
        boron         5          metalloids         solid  
        carbon        6          other nonmetals    solid  
        nitrogen      7          other nonmetals    gas    
        oxygen        8          other nonmetals    gas    
        fluorine      9          halogens           gas    
        neon         10          noble gases        gas    
    

    Your Data or Experiment

    Do you have a situation where some of your data can be represented with nominal or ordinal arrays? How do you manage that information now? Let me know by posting here.




    Published with MATLAB® 7.9


    • print