Analyzing Addresses Using Different Data Structures

Posted by Loren Shure, November 19, 2008

9 views (last 30 days) | 0 Likes | 6 comments

Recently, at MathWorks, Seth decided to analyze the email domains for comments on his blog. He had fun writing the code and posting it internally. Within a short period of time, several other solutions emerged, including one that takes advantage of a new feature in R2008b: containers.Map. Let's first check out the problem itself.

List of email addresses
Seth's Solution Using regexp and cellfun
Dave's Solution using containers.Map
What IS containers.Map?
Steve's Solution Using hist and unique
Loren's Solution Using strfind and accumarray
Other Ways, Other Data Structures?

List of email addresses

Seth wanted to know how many comments on this blog came from each domain, e.g., each location after the '@' symbol.

Here's a list of made up email addresses for comments that came to Seth.

s = {'0123456789@uni.ac.za'
'alphabet_a@yahoo.com'
'alphabet_b@gmail.com'
'alpahbet_c@hotmail.com'
'dAlphabet@xyzabc.com'
'e_alphabet@yahoo.co.in'
'falphabet@gmail.com'
'loren.shure@mathworks.com'
'loren@mathworks.com'
'alphabetG@hanme.com'
'alphabetH@bname.com'
'alphabet_I@yahoo.co.in'
'andJ@gmail.com'
'Khere@gmail.com'
'L-Student@erau.edu'
'M@gmx.de'};

Seth's Solution Using regexp and cellfun

Seth first uses regexp to identify the locations of the '@' symbol in each address.

s1 = regexp(s,'@[\w.]+','match','once');

Next Seth finds the unique domain names, and finds the length of the longest name (used later to help him print the results).

sunique = unique(lower(s1));
n = cell(length(sunique),1);
mx = max(cellfun(@length,sunique));

Finally, Seth prints the results.

for i=1:length(sunique)
    n{i} = length(strmatch(sunique{i},lower(s1)));
    disp([' ' sunique{i} repmat(' ',[1 mx-length(sunique{i})]) ...
        '  ' num2str(n{i})])
end

 @bname.com      1
 @erau.edu       1
 @gmail.com      4
 @gmx.de         1
 @hanme.com      1
 @hotmail.com    1
 @mathworks.com  2
 @uni.ac.za      1
 @xyzabc.com     1
 @yahoo.co.in    2
 @yahoo.com      1

Dave's Solution using containers.Map

The next solution I show is from Dave Tarkowski, and uses a containers.Map object. Dave starts by adopting Seth's code to find the matches.

s1 = regexp(s,'@[\w.]+','match','once');

Next, create an object of the class containers.Map.

m = containers.Map

m = 
  containers.Map handle
  Package: containers

  Properties:
        Count: 0
      KeyType: 'char'
    ValueType: 'any'

And now loop through each domain, check if the domain name is already a key name.

If the key doesn't exist yet, create it and set its value to 1, otherwise increase the value of the existing key by 1.

for i = 1:length(s1)
    if m.isKey(s1{i})
        m(s1{i}) = m(s1{i}) + 1;
    else
        m(s1{i}) = 1;
    end
end

Finally, print out the statistics.

for k = keys(m)
    disp([k{1} ' ' num2str(m(k{1}))])
end

@bname.com 1
@erau.edu 1
@gmail.com 4
@gmx.de 1
@hanme.com 1
@hotmail.com 1
@mathworks.com 2
@uni.ac.za 1
@xyzabc.com 1
@yahoo.co.in 2
@yahoo.com 1

What IS containers.Map?

containers.Map is a data structure often called a hash table or map. You could use a Java hash table previously if you were set up to work with Java in MATLAB. Now you now longer need to bridge the MATLAB-Java interface for this functionality. The containers.Map class provides a memory-efficient implementation of this data structure. While similar in feel to a MATLAB struct, a containers.Map object does not limit the key (similar to the field of a struct to be a valid MATLAB identifier. The key can instead be any one of the following:

scalar integer (signed or unsigned)
scalar single or double
1xn character array, even with embedded spaces

And there are a handful of methods you can use on these objects.

methods(m)

Methods for class containers.Map:

Map          findobj      isKey        length       remove       values       
addlistener  findprop     isvalid      lt           size         
delete       ge           keys         ne           subsasgn     
eq           gt           le           notify       subsref

Steve's Solution Using hist and unique

Steve Lord came along with a different solution. In his words:

Rather than using regexp (since I can never remember the regular expression language) and cellfun, I used strtok, hist and a second unique call.

Split the addresses at the AT symbol

[T, R] = strtok(s, '@');

Use the fact that unique gives you b and k such that b{k} == R.

[suffixes, ignore, k] = unique(lower(R));

Use the fact that unique sorts its output as well as removing duplicates. Send the relevant outputs to hist.

[counts, indices] = hist(k, unique(k));

Print the results.

[suffixes, num2cell(counts.')]

ans = 
    '@bname.com'        [1]
    '@erau.edu'         [1]
    '@gmail.com'        [4]
    '@gmx.de'           [1]
    '@hanme.com'        [1]
    '@hotmail.com'      [1]
    '@mathworks.com'    [2]
    '@uni.ac.za'        [1]
    '@xyzabc.com'       [1]
    '@yahoo.co.in'      [2]
    '@yahoo.com'        [1]

Loren's Solution Using strfind and accumarray

I came late to the game on this one. Before seeing any solutions except Seth's, I thought about using strfind to locate the domain, and accumarray to gather the data.

inds = strfind(s,'@');
allSuffixes = cellfun(@(str,ind) str(ind:end), s, inds, ...
    'UniformOutput', false);
[uniqueSuffixes,ignore,uind] = unique(lower(allSuffixes));

% I'd be remiss if there was no solution using accumarray
counts = accumarray(uind,1,[length(uniqueSuffixes) 1]);
[uniqueSuffixes num2cell(counts)]

ans = 
    '@bname.com'        [1]
    '@erau.edu'         [1]
    '@gmail.com'        [4]
    '@gmx.de'           [1]
    '@hanme.com'        [1]
    '@hotmail.com'      [1]
    '@mathworks.com'    [2]
    '@uni.ac.za'        [1]
    '@xyzabc.com'       [1]
    '@yahoo.co.in'      [2]
    '@yahoo.com'        [1]

Other Ways, Other Data Structures?

Yes, there are, no doubt, other ways to complete this same task, including some ideas for the counting such as those in this post. Do you have other favorite ways to do this sort of task? Besides containers.Map, what other data structures would you like to see in MATLAB? As usual, please post your thoughts here.

Published with MATLAB® 7.7