Loren on the Art of MATLAB

November 19th, 2008

Analyzing Addresses Using Different Data Structures

Recently, at MathWorks, Seth decided to analyze the email domains for comments on his blog. He had fun writing the code and posting it internally. Within a short period of time, several other solutions emerged, including one that takes advantage of a new feature in R2008b: containers.Map. Let's first check out the problem itself.

Contents

List of email addresses

Seth wanted to know how many comments on this blog came from each domain, e.g., each location after the '@' symbol.

Here's a list of made up email addresses for comments that came to Seth.

s = {'0123456789@uni.ac.za'
'alphabet_a@yahoo.com'
'alphabet_b@gmail.com'
'alpahbet_c@hotmail.com'
'dAlphabet@xyzabc.com'
'e_alphabet@yahoo.co.in'
'falphabet@gmail.com'
'loren.shure@mathworks.com'
'loren@mathworks.com'
'alphabetG@hanme.com'
'alphabetH@bname.com'
'alphabet_I@yahoo.co.in'
'andJ@gmail.com'
'Khere@gmail.com'
'L-Student@erau.edu'
'M@gmx.de'};

Seth's Solution Using regexp and cellfun

Seth first uses regexp to identify the locations of the '@' symbol in each address.

s1 = regexp(s,'@[\w.]+','match','once');

Next Seth finds the unique domain names, and finds the length of the longest name (used later to help him print the results).

sunique = unique(lower(s1));
n = cell(length(sunique),1);
mx = max(cellfun(@length,sunique));

Finally, Seth prints the results.

for i=1:length(sunique)
    n{i} = length(strmatch(sunique{i},lower(s1)));
    disp([' ' sunique{i} repmat(' ',[1 mx-length(sunique{i})]) ...
        '  ' num2str(n{i})])
end
 @bname.com      1
 @erau.edu       1
 @gmail.com      4
 @gmx.de         1
 @hanme.com      1
 @hotmail.com    1
 @mathworks.com  2
 @uni.ac.za      1
 @xyzabc.com     1
 @yahoo.co.in    2
 @yahoo.com      1

Dave's Solution using containers.Map

The next solution I show is from Dave Tarkowski, and uses a containers.Map object. Dave starts by adopting Seth's code to find the matches.

s1 = regexp(s,'@[\w.]+','match','once');

Next, create an object of the class containers.Map.

m = containers.Map
m = 
  containers.Map handle
  Package: containers

  Properties:
        Count: 0
      KeyType: 'char'
    ValueType: 'any'

And now loop through each domain, check if the domain name is already a key name.

If the key doesn't exist yet, create it and set its value to 1, otherwise increase the value of the existing key by 1.

for i = 1:length(s1)
    if m.isKey(s1{i})
        m(s1{i}) = m(s1{i}) + 1;
    else
        m(s1{i}) = 1;
    end
end

Finally, print out the statistics.

for k = keys(m)
    disp([k{1} ' ' num2str(m(k{1}))])
end
@bname.com 1
@erau.edu 1
@gmail.com 4
@gmx.de 1
@hanme.com 1
@hotmail.com 1
@mathworks.com 2
@uni.ac.za 1
@xyzabc.com 1
@yahoo.co.in 2
@yahoo.com 1

What IS containers.Map?

containers.Map is a data structure often called a hash table or map. You could use a Java hash table previously if you were set up to work with Java in MATLAB. Now you now longer need to bridge the MATLAB-Java interface for this functionality. The containers.Map class provides a memory-efficient implementation of this data structure. While similar in feel to a MATLAB struct, a containers.Map object does not limit the key (similar to the field of a struct to be a valid MATLAB identifier. The key can instead be any one of the following:

  • scalar integer (signed or unsigned)
  • scalar single or double
  • 1xn character array, even with embedded spaces

And there are a handful of methods you can use on these objects.

methods(m)
Methods for class containers.Map:

Map          findobj      isKey        length       remove       values       
addlistener  findprop     isvalid      lt           size         
delete       ge           keys         ne           subsasgn     
eq           gt           le           notify       subsref      

Steve's Solution Using hist and unique

Steve Lord came along with a different solution. In his words:

Rather than using regexp (since I can never remember the regular expression language) and cellfun, I used strtok, hist and a second unique call.

Split the addresses at the AT symbol

[T, R] = strtok(s, '@');

Use the fact that unique gives you b and k such that b{k} == R.

[suffixes, ignore, k] = unique(lower(R));

Use the fact that unique sorts its output as well as removing duplicates. Send the relevant outputs to hist.

[counts, indices] = hist(k, unique(k));

Print the results.

[suffixes, num2cell(counts.')]
ans = 
    '@bname.com'        [1]
    '@erau.edu'         [1]
    '@gmail.com'        [4]
    '@gmx.de'           [1]
    '@hanme.com'        [1]
    '@hotmail.com'      [1]
    '@mathworks.com'    [2]
    '@uni.ac.za'        [1]
    '@xyzabc.com'       [1]
    '@yahoo.co.in'      [2]
    '@yahoo.com'        [1]

Loren's Solution Using strfind and accumarray

I came late to the game on this one. Before seeing any solutions except Seth's, I thought about using strfind to locate the domain, and accumarray to gather the data.

inds = strfind(s,'@');
allSuffixes = cellfun(@(str,ind) str(ind:end), s, inds, ...
    'UniformOutput', false);
[uniqueSuffixes,ignore,uind] = unique(lower(allSuffixes));

% I'd be remiss if there was no solution using accumarray
counts = accumarray(uind,1,[length(uniqueSuffixes) 1]);
[uniqueSuffixes num2cell(counts)]
ans = 
    '@bname.com'        [1]
    '@erau.edu'         [1]
    '@gmail.com'        [4]
    '@gmx.de'           [1]
    '@hanme.com'        [1]
    '@hotmail.com'      [1]
    '@mathworks.com'    [2]
    '@uni.ac.za'        [1]
    '@xyzabc.com'       [1]
    '@yahoo.co.in'      [2]
    '@yahoo.com'        [1]

Other Ways, Other Data Structures?

Yes, there are, no doubt, other ways to complete this same task, including some ideas for the counting such as those in this post. Do you have other favorite ways to do this sort of task? Besides containers.Map, what other data structures would you like to see in MATLAB? As usual, please post your thoughts here.


Get the MATLAB code

Published with MATLAB® 7.7

5 Responses to “Analyzing Addresses Using Different Data Structures”

  1. Marc replied on :

    Been wanting a container/map/hashtable to be native in ML for a while. Thanks to the devs.

  2. Dave Tarkowski replied on :

    I was interested to see the relative timing for these methods. I wrote up a little script to generate random e-mail address in MATLAB. Here are the results for 10,000 e-mail addresses with 100 unique domains:

    >> doCompare(10000)

    ans =

    seth: 1.5446
    dave: 2.2605
    steve: 0.4783
    loren: 0.1521

    What I found interesting is how the timing changes if you change the number of unique domains. Here are the results with 10,000 e-mail addresses, but only 10 unique domains:

    >> doCompare(10000)

    ans =

    seth: 0.1973
    dave: 2.2455
    steve: 0.4725
    loren: 0.1494

    Of course, running time, although relatively easy to measure, is only one aspect to consider. One big difference between these algorithms is their memory usage.

  3. Seth Popinchalk replied on :

    Dave - great analysis. Interesting - the length of my loop changes with the number of unique domains. I like the vectorized methods because of the use of tricks like additional outputs from UNIQUE or ACCUMARRAY.

  4. Andrew Mullhaupt replied on :

    I need something more like an associative array, since my keys are integer vectors.

    Cell arrays could in principle, be used, except they appear to use way too much memory. (Mine would be very sparse).

  5. OysterEngineer replied on :

    I love this type of post since each solution shows something about the background of the developer. Kind of like, “If the only tool you have is a hammer, every problem looks like a nail.”

    Still, Loren’s solution is the easiest to follow.

    Although I can follow a lot of Seth’s solution, I am stumped by the regexp call & the guts of the for loop.

    I understand the power of regular expressions & know that they are complex enough that several books have been published on just them. But, I’ve always had trouble understanding the syntax & I’ve never found a complete, comprehensive FRP for just the arguments for regexp.

    Looking at Seth’s command, & the MatLab FRP for regexp, he is clearly using the 3rd syntax option. I understand the s is the cell array from above & under Remarks in the FRP, it does explain why s can be a cell array.

    Further, I see that the expression is the ‘@[\w.]+’ string. But, I can’t find any documentation that allows the use of the square brackets in this syntax. In reading the FRP of regexp, it seems to me that ‘@’ all by itself should work.

    Next, I see that ‘match’ is an allowable qualifier that forces the returned result to be the matching string. But, that isn’t what he wants here. He wants to return the index of the ‘@’, not the ‘@’ itself. And, frankly, since you know you are searching for the ‘@’ character, you could populate the cell array sl with the ‘@’ characters without needing to use the regexp function.

    Finally, I see how the how ‘once’ is an allowable option.

    This illustrates why MatLab either needs a more user friendly pattern matching function or it needs a much better written FRP for regexp.

    I understand that given the history of regular expressions, it clearly is important for MatLab to include it. But, since the documentation out there in the world of Unix, C and C similar languages for regular expressions is generally inadequate for a user to master it, MatLab must not just base its documentation on these flawed sources & must write its own comprehensive set.

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Loren: Hi Ghazanfar- Please look at the help for the function hist (and perhaps histc). They will do what you want...
  • Ghazanfar Ali: Hi Miss Loren I am in need of an algo to count cosecutive duplicate values in a one dimensional...
  • OysterEngineer: Clearly this is a complex topic & these pair of blogs show that The MathWorks are masters of the...
  • Tim Davis: I’ve often been puzzled about how sub-matrix-assignmen t works for full and sparse arrays: clear A =...
  • Loren: Paul- There *are* issues depending on the sizes of ii and jj. And it’s a bit complicated, but really...
  • Loren: Bob- You don’t say what happens when you run your code. Can you please explain more. It looks like you...
  • Loren: Kishore- It is not clear to me what you are trying to actually achieve. If you want to concatenate the 4...
  • Kishore: sorry, in the previous code mat2cell(c,[19 121],[19 134],[19 84],[19 107])
  • Kishore: Hi Loren, Why does the following not work? data_classwise = [19x121 double] [19x134 double] [19x84 double]...
  • Paul Jackson: Loren, Are there any aspects of empty matrices that may be tricky when they are used as indices into...

These postings are the author's and don't necessarily represent the opinions of The MathWorks.