Recently, at MathWorks, Seth decided to analyze the email domains for comments on his blog. He had fun writing the code and posting it internally. Within a short period of time, several other solutions emerged, including one that takes advantage of a new feature in R2008b: containers.Map. Let's first check out the problem itself.
Contents
List of email addresses
Seth wanted to know how many comments on this blog came from each domain, e.g., each location after the '@' symbol.
Here's a list of made up email addresses for comments that came to Seth.
s = {'0123456789@uni.ac.za'
'alphabet_a@yahoo.com'
'alphabet_b@gmail.com'
'alpahbet_c@hotmail.com'
'dAlphabet@xyzabc.com'
'e_alphabet@yahoo.co.in'
'falphabet@gmail.com'
'loren.shure@mathworks.com'
'loren@mathworks.com'
'alphabetG@hanme.com'
'alphabetH@bname.com'
'alphabet_I@yahoo.co.in'
'andJ@gmail.com'
'Khere@gmail.com'
'L-Student@erau.edu'
'M@gmx.de'};Seth's Solution Using regexp and cellfun
Seth first uses regexp to identify the locations of the '@' symbol in each address.
s1 = regexp(s,'@[\w.]+','match','once');
Next Seth finds the unique domain names, and finds the length of the longest name (used later to help him print the results).
sunique = unique(lower(s1)); n = cell(length(sunique),1); mx = max(cellfun(@length,sunique));
Finally, Seth prints the results.
for i=1:length(sunique) n{i} = length(strmatch(sunique{i},lower(s1))); disp([' ' sunique{i} repmat(' ',[1 mx-length(sunique{i})]) ... ' ' num2str(n{i})]) end
@bname.com 1 @erau.edu 1 @gmail.com 4 @gmx.de 1 @hanme.com 1 @hotmail.com 1 @mathworks.com 2 @uni.ac.za 1 @xyzabc.com 1 @yahoo.co.in 2 @yahoo.com 1
Dave's Solution using containers.Map
The next solution I show is from Dave Tarkowski, and uses a containers.Map object. Dave starts by adopting Seth's code to find the matches.
s1 = regexp(s,'@[\w.]+','match','once');
Next, create an object of the class containers.Map.
m = containers.Map
m =
containers.Map handle
Package: containers
Properties:
Count: 0
KeyType: 'char'
ValueType: 'any'
And now loop through each domain, check if the domain name is already a key name.
If the key doesn't exist yet, create it and set its value to 1, otherwise increase the value of the existing key by 1.
for i = 1:length(s1) if m.isKey(s1{i}) m(s1{i}) = m(s1{i}) + 1; else m(s1{i}) = 1; end end
Finally, print out the statistics.
for k = keys(m) disp([k{1} ' ' num2str(m(k{1}))]) end
@bname.com 1 @erau.edu 1 @gmail.com 4 @gmx.de 1 @hanme.com 1 @hotmail.com 1 @mathworks.com 2 @uni.ac.za 1 @xyzabc.com 1 @yahoo.co.in 2 @yahoo.com 1
What IS containers.Map?
containers.Map is a data structure often called a hash table or map. You could use a Java hash table previously if you were set up to work with Java in MATLAB. Now you now longer need to bridge the MATLAB-Java interface for this functionality. The containers.Map class provides a memory-efficient implementation of this data structure. While similar in feel to a MATLAB struct, a containers.Map object does not limit the key (similar to the field of a struct to be a valid MATLAB identifier. The key can instead be any one of the following:
- scalar integer (signed or unsigned)
- scalar single or double
- 1xn character array, even with embedded spaces
And there are a handful of methods you can use on these objects.
methods(m)
Methods for class containers.Map: Map findobj isKey length remove values addlistener findprop isvalid lt size delete ge keys ne subsasgn eq gt le notify subsref
Steve's Solution Using hist and unique
Steve Lord came along with a different solution. In his words:
Rather than using regexp (since I can never remember the regular expression language) and cellfun, I used strtok, hist and a second unique call.
Split the addresses at the AT symbol
[T, R] = strtok(s, '@');Use the fact that unique gives you b and k such that b{k} == R.
[suffixes, ignore, k] = unique(lower(R));
Use the fact that unique sorts its output as well as removing duplicates. Send the relevant outputs to hist.
[counts, indices] = hist(k, unique(k));
Print the results.
[suffixes, num2cell(counts.')]
ans =
'@bname.com' [1]
'@erau.edu' [1]
'@gmail.com' [4]
'@gmx.de' [1]
'@hanme.com' [1]
'@hotmail.com' [1]
'@mathworks.com' [2]
'@uni.ac.za' [1]
'@xyzabc.com' [1]
'@yahoo.co.in' [2]
'@yahoo.com' [1]
Loren's Solution Using strfind and accumarray
I came late to the game on this one. Before seeing any solutions except Seth's, I thought about using strfind to locate the domain, and accumarray to gather the data.
inds = strfind(s,'@'); allSuffixes = cellfun(@(str,ind) str(ind:end), s, inds, ... 'UniformOutput', false); [uniqueSuffixes,ignore,uind] = unique(lower(allSuffixes)); % I'd be remiss if there was no solution using accumarray counts = accumarray(uind,1,[length(uniqueSuffixes) 1]); [uniqueSuffixes num2cell(counts)]
ans =
'@bname.com' [1]
'@erau.edu' [1]
'@gmail.com' [4]
'@gmx.de' [1]
'@hanme.com' [1]
'@hotmail.com' [1]
'@mathworks.com' [2]
'@uni.ac.za' [1]
'@xyzabc.com' [1]
'@yahoo.co.in' [2]
'@yahoo.com' [1]
Other Ways, Other Data Structures?
Yes, there are, no doubt, other ways to complete this same task, including some ideas for the counting such as those in this post. Do you have other favorite ways to do this sort of task? Besides containers.Map, what other data structures would you like to see in MATLAB? As usual, please post your thoughts here.
Get
the MATLAB code
Published with MATLAB® 7.7



Been wanting a container/map/hashtable to be native in ML for a while. Thanks to the devs.
I was interested to see the relative timing for these methods. I wrote up a little script to generate random e-mail address in MATLAB. Here are the results for 10,000 e-mail addresses with 100 unique domains:
>> doCompare(10000)
ans =
seth: 1.5446
dave: 2.2605
steve: 0.4783
loren: 0.1521
What I found interesting is how the timing changes if you change the number of unique domains. Here are the results with 10,000 e-mail addresses, but only 10 unique domains:
>> doCompare(10000)
ans =
seth: 0.1973
dave: 2.2455
steve: 0.4725
loren: 0.1494
Of course, running time, although relatively easy to measure, is only one aspect to consider. One big difference between these algorithms is their memory usage.
Dave - great analysis. Interesting - the length of my loop changes with the number of unique domains. I like the vectorized methods because of the use of tricks like additional outputs from UNIQUE or ACCUMARRAY.
I need something more like an associative array, since my keys are integer vectors.
Cell arrays could in principle, be used, except they appear to use way too much memory. (Mine would be very sparse).
I love this type of post since each solution shows something about the background of the developer. Kind of like, “If the only tool you have is a hammer, every problem looks like a nail.”
Still, Loren’s solution is the easiest to follow.
Although I can follow a lot of Seth’s solution, I am stumped by the regexp call & the guts of the for loop.
I understand the power of regular expressions & know that they are complex enough that several books have been published on just them. But, I’ve always had trouble understanding the syntax & I’ve never found a complete, comprehensive FRP for just the arguments for regexp.
Looking at Seth’s command, & the MatLab FRP for regexp, he is clearly using the 3rd syntax option. I understand the s is the cell array from above & under Remarks in the FRP, it does explain why s can be a cell array.
Further, I see that the expression is the ‘@[\w.]+’ string. But, I can’t find any documentation that allows the use of the square brackets in this syntax. In reading the FRP of regexp, it seems to me that ‘@’ all by itself should work.
Next, I see that ‘match’ is an allowable qualifier that forces the returned result to be the matching string. But, that isn’t what he wants here. He wants to return the index of the ‘@’, not the ‘@’ itself. And, frankly, since you know you are searching for the ‘@’ character, you could populate the cell array sl with the ‘@’ characters without needing to use the regexp function.
Finally, I see how the how ‘once’ is an allowable option.
This illustrates why MatLab either needs a more user friendly pattern matching function or it needs a much better written FRP for regexp.
I understand that given the history of regular expressions, it clearly is important for MatLab to include it. But, since the documentation out there in the world of Unix, C and C similar languages for regular expressions is generally inadequate for a user to master it, MatLab must not just base its documentation on these flawed sources & must write its own comprehensive set.