Analyzing Baby Names with MATLAB R2015b

저자 Loren Shure, November 3, 2015

3 회 조회 (최근 30일) | 0 좋아요 | 1 댓글

Today's guest blogger is Matt Tearle, from the Training Services group here at MathWorks. Matt puts two data analysis functions introduced in R2015b through their paces with some unexpected uses.

Introducing two new functions
Popular baby names, from Aagot to Zzyzx
Putting theory into practice
Preaching the good news
Converting counts to percentages
Where will findgroups and splitapply take you?

Introducing two new functions

Our tireless MATLAB developers keep adding great new functionality for data analysis. I've been having fun playing with two new functions introduced in R2015b: findgroups and splitapply. These functions make it easy to find all unique combinations of key variables, assign a group number to each observation in your data based on those key variables, then apply a function to each of those groups.

The documentation and the release video show examples of the standard, expected use, such as finding the median blood pressure measurement of patients in a trial, grouped by the patients' gender and the drug dosage they received. Let's make a table of some simulated data for this drug trial example.

rng(1234)
numpatients = 15;
gender = categorical(randi(2,numpatients,1),1:2,{'F','M'});
drugdose = 50*randi(3,numpatients,1);
bp = 100 + 40*rand(numpatients,1);
patientdata = table(gender,drugdose,bp,'VariableNames',{'Gender','Dose','BP'})

patientdata = 
    Gender    Dose      BP  
    ______    ____    ______
    F         100     134.77
    M         100     117.45
    F          50     132.09
    M         150     105.75
    M         150     128.17
    F         100     128.18
    F         100     108.75
    M          50     136.99
    M         100     117.69
    M         150     136.37
    F         100     102.39
    M         100     107.37
    M         150     101.89
    M          50        127
    F         100     123.78

First use findgroups to assign a group number to each patient, where the groups are defined by the combination of Gender and Dose. It's then easy to calculate the median of BP for each group using splitapply.

group = findgroups(patientdata(:,{'Gender','Dose'}));
medianbp = splitapply(@median,patientdata.BP,group)

Note that there are five values, even though there are six possible combinations of gender and drug dose. You can use the second output from findgroups to obtain the list of groups. Because the input is a table, the group names will also be returned as a table. You can use dot notation indexing to add the median blood pressure values to the table.

[group,gender_and_dose] = findgroups(patientdata(:,{'Gender','Dose'}));
gender_and_dose.MedianBP = splitapply(@median,patientdata.BP,group)

gender_and_dose = 
    Gender    Dose    MedianBP
    ______    ____    ________
    F          50     132.09  
    F         100     123.78  
    M          50     131.99  
    M         100     117.45  
    M         150     116.96

One of best things about my job is that I get to see these new features take shape and think about how MATLAB users (like you!) can put them to work. What's really fun is when I get to use the features myself for my own work. Or play. And what's really really fun is when that usage employs a little "out-of-the-box" thinking. In this case, I think I discovered applications for findgroups and splitapply that were not quite what the developers had in mind.

Popular baby names, from Aagot to Zzyzx

Maybe there's something in the water in our building, but a lot of coworkers are having babies. This naturally leads to many coffee-machine conversations about baby stuff: sleep deprivation, busybody in-laws, and, of course, names.

One particular conversation turned to common and uncommon baby names. (My own kids are in the former category, and a coworker's is in the latter.) Being the data-driven MATLAB nerds that we are, we figured there must be a way to quantify just how common a given name is. I recalled that, for the U.S., the Social Security Administration keeps records of baby names. A minute or two with a well-known internet search engine and I had my data source -- a zip file containing names and numbers for each year from 1880 to 2014. All I had to do was read the data from these files into MATLAB, and I'd be able to find out where any given name ranked in all-time popularity.

Each file in the downloaded folder contained data for a given year and consisted of three columns: name, gender, and the number of children (of that gender) registered with that name in that year. The first few lines of the file for 1880 look like this:

dbtype('names\yob1880.txt','1:8')

1     Mary,F,7065
2     Anna,F,2604
3     Emma,F,2003
4     Elizabeth,F,1939
5     Minnie,F,1746
6     Margaret,F,1578
7     Ida,F,1472
8     Alice,F,1414

And 1980 looks like this:

dbtype('names\yob1980.txt','1:8')

1     Jennifer,F,58385
2     Amanda,F,35820
3     Jessica,F,33920
4     Melissa,F,31631
5     Sarah,F,25741
6     Heather,F,19971
7     Nicole,F,19916
8     Amy,F,19832

Note that the names are not in the same order from year to year (they are in order of popularity for that year). Different files do not even contain the same list of names. In 2010 there were 5 boys called Zzyzx, and in 1915 there were 5 girls called Aagot; in no other years were either of these names popular enough to reach the 5-name cutoff.

Reading any one file into MATLAB is easy -- it's just delimited text. But how should I compile the data from all years into one array? This actually depends on how I plan to use the data. If I want to study patterns through time, I'd need to make a table of names with annual counts, like this:

      |  Mary   Anna  ...  Aagot  ...
------|-------------------------------
 1880 |  7065   2604  ...      0  ...
 1881 |  6919   2698  ...      0  ...
 ...  |               ...
 1915 | 58187  15120  ...      5  ...
 ...  |               ...

There are ways to accomplish this using tables, which are another (somewhat) new feature of MATLAB that I love and frequently use for my own work. But in this case, I didn't need the individual annual values; I just needed the all-time totals. It occurred to me that if I had all the data concatenated into one big list, then I'd just need to combine all the repeated name/gender combinations:

Mary,F,7065      <------|
Anna,F,2604             |
Emma,F,2003             |
...                     |
Mary,F,6919      <------|
Anna,F,2698             |
...                     |--- sum --> Total number of girls called Mary
Mary,F,58187     <------|
Helen,F,30866           |
Dorothy,F,25154         |
...                     |
Liliana,F,2624          |
Mary,F,2611      <------|
Elena,F,2598
...

By now hopefully you're seeing what I saw: findgroups and splitapply can actually be applied to collate my data. The groups will be the combinations of name and gender. The function to apply will be sum and the data to apply it to will be the number.

Putting theory into practice

Now all I need is to get all the data in as one big table. It turns out that this is very easy using another recent MATLAB feature: datastores. Datastores are designed to make it easy to read data from multiple files with the same formatting. In this case, all the files are comma-delimited text with three columns.

dat = datastore('names\yob*.txt',...
    'ReadVariableNames',false,'VariableNames',{'Name','Gender','Number'});

This creates a datastore that refers to all the data files. The options specify the variable names for the three columns, rather than obtain them by reading a header line in the file (these files do not contain any header information). The format for each column has been determined automatically. But rather than read the gender as a string, I can modify the format to read it as a categorical variable:

dat.SelectedFormats{2} = '%C';

Categorical variables are easier and more efficent to work with than strings, when dealing with a large number of instances of a small number of unique strings ('F' and 'M', in this case).

Now I'm ready to read the data. I can read each file individually with the read function, but I can also read everything in one go with the readall function:

rawnamedata = readall(dat);
whos rawnamedata

  Name                   Size                Bytes  Class    Attributes

  rawnamedata      1825433x3             243420481  table

Wow. Importing data has never been so easy! Now that I have all the data in one table, I can use findgroups to find all name/gender combinations and assign groups accordingly:

[group,names] = findgroups(rawnamedata(:,{'Name','Gender'}));

And, finally, splitapply to collate the totals:

names.Number = splitapply(@sum,rawnamedata.Number,group);
whos names

  Name            Size               Bytes  Class    Attributes

  names      104110x3             13943724  table

Now names is a table like rawnamedata but with all the repeats for each name totalled into a single observation.

It's now straightforward to obtain the all-time number for any given name. I can leave the names as strings, which allows for all sorts of string matching ("names starting with 'Aa'"), or convert them to categorical, which allows only exact matching but with == instead of strcmp.

names.Name = categorical(names.Name);
names(names.Name == 'Loren',:)
names(names.Name == 'Cleve',:)

ans = 
    Name     Gender    Number
    _____    ______    ______
    Loren    F         11862 
    Loren    M         45806 
ans = 
    Name     Gender    Number
    _____    ______    ______
    Cleve    F           23  
    Cleve    M         4503

Preaching the good news

You may have noticed that I've leveraged a lot of new features: tables, categoricals, and datastores, as well as findgroups and splitapply. I'm definitely a big fan of using new capabilities whenever possible to do more things more easily. For comparison's sake, here's what I came up with for how to do this same task in 2012:

dbtype('oldschool.m')

1     fnm = cellstr(ls('names\yob*.txt'));
2     nf = length(fnm);
3     alldata = cell(nf,3);
4     for k = 1:nf
5         fid = fopen(['names\',fnm{k}]);
6         alldata(k,:) = textscan(fid,'%s%s%f','Delimiter',',');
7         fid = fclose(fid);
8     end
9     
10    names = cat(1,alldata{:,1});
11    gender = cat(1,alldata{:,2});
12    number = cat(1,alldata{:,3});
13    
14    [~,idx1,idx2] = unique(strcat(names,',',gender));
15    number = accumarray(idx2,number);
16    
17    names = [names(idx1) gender(idx1) num2cell(number)];
18    
19    names(strcmp(names(:,1),'Loren'),:)
20    names(strcmp(names(:,1),'Cleve'),:)

I think it's a testament to our dedicated developers that only three years later we can now cut that script in half, lose the for-loop, not have to worry about textscan or accumarray, and end up with our data in a much neater container:

dbtype('analyzeNameData.m')

1     dat = datastore('names\yob*.txt',...
2         'ReadVariableNames',false,'VariableNames',{'Name','Gender','Number'});
3     dat.TextscanFormats{2} = '%C';
4     
5     rawnamedata = readall(dat);
6     
7     [group,names] = findgroups(rawnamedata(:,{'Name','Gender'}));
8     names.Number = splitapply(@sum,rawnamedata.Number,group);
9     
10    names.Name = categorical(names.Name);
11    names(names.Name == 'Loren',:)
12    names(names.Name == 'Cleve',:)

Converting counts to percentages

I can now get the numbers on a given name, but it's hard to really get a feel for the popularity of a name without the context of the total number of people represented by the data. In other words, I'd like to normalize the counts to percentages. Calculating the percentage of the total population would be easy, but it's more typical to think of percentage by gender. Immediately that sounds like another grouped operation, but can I use splitapply here? Normally splitapply takes a group of data and returns a single value from a statistical operation such as sum or mean. What if the return is not a single value?

Here's a function that takes a vector of nonnegative values as input and returns a normalized version of the vector as output:

nrm = @(x) 100*x/sum(x);

What happens if I use that as my splitapply function, with the number as the data and the gender as the group? As written, this will actually produce an error. However, it's a very helpful error that suggests an alternative:

nrm = @(x) {100*x/sum(x)};

This returns the same result, but as the contents of a cell. So my function does return a single value, except it's a cell instead of a number. Applying this to each group with splitapply will therefore return a cell array. Note that I already have the groups defined for me in names.Gender, but this is a categorical array and splitapply wants numeric group values. So I'll use findgroups here, but it's interesting to note that you don't always have to -- sometimes you have the group values from somewhere else, and that's fine.

gender = findgroups(names.Gender);
pctcell = splitapply(nrm,names.Number,gender)

pctcell = 
    [64911x1 double]
    [39199x1 double]

Because I have two groups, I get two outputs. But I actually want all 1825433 individual values corresponding to my original data. This isn't really the typical, intended use of splitapply, but I can make it work for me. I can extract the contents of the cell array and concatenate them together, but I don't know if the result will be in the same order as the original data. A simple way to avoid that is to sort my data by group first.

names = sortrows(names,'Gender');
gender = findgroups(names.Gender);
pctcell = splitapply(nrm,names.Number,gender);
names.Percent = cat(1,pctcell{:});
names(names.Name == 'Loren',:)
names(names.Name == 'Cleve',:)

ans = 
    Name     Gender    Number    Percent 
    _____    ______    ______    ________
    Loren    F         11862       0.0071
    Loren    M         45806     0.026934
ans = 
    Name     Gender    Number     Percent  
    _____    ______    ______    __________
    Cleve    F           23      1.3767e-05
    Cleve    M         4503       0.0026478

How far can I push this? One neat trick is to use a plotting function as my grouped operation. For example, how are names distributed? To avoid repeating calculations, I'll first calculate the cumulative distribution, using the same approach as above.

names = sortrows(names,{'Gender','Percent'},{'ascend','descend'});
grpaccumulate = @(x) {cumsum(x)};
pctcell = splitapply(grpaccumulate,names.Percent,gender);
names.CumulativeDist = cat(1,pctcell{:});

And now I can make a grouped plot by using splitapply with plot as my function. Again, plot isn't the kind of function you'd normally use with splitapply, but does it work? Yes! (Just remember to hold on so that the repeated calls to plot don't overwrite each other.)

figure
hold on
splitapply(@plot,names.CumulativeDist,gender);

legend(categories(names.Gender),'Location','southeast')
axis([0 5000 0 100])
xlabel('Number of unique names')
ylabel('Percentage of (male or female) population')
grid on
hold off

From this you can see that about 3/4 of girls (in the U.S. since 1880) have shared the set of the most popular 500 names. And a little over 85% of boys have shared the same number. Let's zoom in...

axis([0 250 0 60])

Only about 60 names are needed to address half the men in the U.S! With a little creativity, I can probably get the exact number...

qntl = @(x) find(x >= 50,1,'first');
splitapply(qntl,names.CumulativeDist,gender)

ans =
   155
    63

This splitapply function is quite useful! Let's find the 25% quartile number.

qntl = @(x) find(x >= 25,1,'first');
splitapply(qntl,names.CumulativeDist,gender)

ans =
    43
    14

How about that? One in every four males you know (in the U.S., ignoring changing trends over time) has one of these 14 names:

boys = names.Name(names.Gender == 'M');
disp(boys(1:14,:))

     James 
     John 
     Robert 
     Michael 
     William 
     David 
     Joseph 
     Richard 
     Charles 
     Thomas 
     Christopher 
     Daniel 
     Matthew 
     George

And there I am on that list. Along with my brother, father, and grandfathers. Oh well. At least I have a weird last name to make up for our lack of imagination!

Interestingly, girls' names show a greater diversity. I don't know why that might be or what (if anything) it means, but it does align with my highly unscientific anecdotal evidence from parents that there are more "good" options when choosing names for girls than boys.

Where will `findgroups` and `splitapply` take you?

There is an old saying that if all you have is a hammer, everything looks like a nail. Maybe I'm guilty of falling into that trap, but I prefer to think of it as seeing just what this new "hammer" thing can do. And in this case, it looks like a pretty handy tool. From conversations about baby names I've already found some interesting applications of findgroups and splitapply. In return they've led me to some intriguing findings about how boys and girls names are distributed.

How about you? Have you done some analysis where splitapply might have helped? Let us know here.

Published with MATLAB® R2015b