Analyzing Addresses Using Different Data Structures
Recently, at MathWorks, Seth decided to analyze the email domains for comments on his blog. He had fun writing the code and posting it internally. Within a short period of time, several other solutions emerged, including one that takes advantage of a new feature in R2008b: containers.Map. Let's first check out the problem itself.
Contents
List of email addresses
Seth wanted to know how many comments on this blog came from each domain, e.g., each location after the '@' symbol.
Here's a list of made up email addresses for comments that came to Seth.
s = {'0123456789@uni.ac.za' 'alphabet_a@yahoo.com' 'alphabet_b@gmail.com' 'alpahbet_c@hotmail.com' 'dAlphabet@xyzabc.com' 'e_alphabet@yahoo.co.in' 'falphabet@gmail.com' 'loren.shure@mathworks.com' 'loren@mathworks.com' 'alphabetG@hanme.com' 'alphabetH@bname.com' 'alphabet_I@yahoo.co.in' 'andJ@gmail.com' 'Khere@gmail.com' 'L-Student@erau.edu' 'M@gmx.de'};
Seth's Solution Using regexp and cellfun
Seth first uses regexp to identify the locations of the '@' symbol in each address.
s1 = regexp(s,'@[\w.]+','match','once');
Next Seth finds the unique domain names, and finds the length of the longest name (used later to help him print the results).
sunique = unique(lower(s1)); n = cell(length(sunique),1); mx = max(cellfun(@length,sunique));
Finally, Seth prints the results.
for i=1:length(sunique) n{i} = length(strmatch(sunique{i},lower(s1))); disp([' ' sunique{i} repmat(' ',[1 mx-length(sunique{i})]) ... ' ' num2str(n{i})]) end
@bname.com 1 @erau.edu 1 @gmail.com 4 @gmx.de 1 @hanme.com 1 @hotmail.com 1 @mathworks.com 2 @uni.ac.za 1 @xyzabc.com 1 @yahoo.co.in 2 @yahoo.com 1
Dave's Solution using containers.Map
The next solution I show is from Dave Tarkowski, and uses a containers.Map object. Dave starts by adopting Seth's code to find the matches.
s1 = regexp(s,'@[\w.]+','match','once');
Next, create an object of the class containers.Map.
m = containers.Map
m = containers.Map handle Package: containers Properties: Count: 0 KeyType: 'char' ValueType: 'any'
And now loop through each domain, check if the domain name is already a key name.
If the key doesn't exist yet, create it and set its value to 1, otherwise increase the value of the existing key by 1.
for i = 1:length(s1) if m.isKey(s1{i}) m(s1{i}) = m(s1{i}) + 1; else m(s1{i}) = 1; end end
Finally, print out the statistics.
for k = keys(m) disp([k{1} ' ' num2str(m(k{1}))]) end
@bname.com 1 @erau.edu 1 @gmail.com 4 @gmx.de 1 @hanme.com 1 @hotmail.com 1 @mathworks.com 2 @uni.ac.za 1 @xyzabc.com 1 @yahoo.co.in 2 @yahoo.com 1
What IS containers.Map?
containers.Map is a data structure often called a hash table or map. You could use a Java hash table previously if you were set up to work with Java in MATLAB. Now you now longer need to bridge the MATLAB-Java interface for this functionality. The containers.Map class provides a memory-efficient implementation of this data structure. While similar in feel to a MATLAB struct, a containers.Map object does not limit the key (similar to the field of a struct to be a valid MATLAB identifier. The key can instead be any one of the following:
- scalar integer (signed or unsigned)
- scalar single or double
- 1xn character array, even with embedded spaces
And there are a handful of methods you can use on these objects.
methods(m)
Methods for class containers.Map: Map findobj isKey length remove values addlistener findprop isvalid lt size delete ge keys ne subsasgn eq gt le notify subsref
Steve's Solution Using hist and unique
Steve Lord came along with a different solution. In his words:
Rather than using regexp (since I can never remember the regular expression language) and cellfun, I used strtok, hist and a second unique call.
Split the addresses at the AT symbol
[T, R] = strtok(s, '@');
Use the fact that unique gives you b and k such that b{k} == R.
[suffixes, ignore, k] = unique(lower(R));
Use the fact that unique sorts its output as well as removing duplicates. Send the relevant outputs to hist.
[counts, indices] = hist(k, unique(k));
Print the results.
[suffixes, num2cell(counts.')]
ans = '@bname.com' [1] '@erau.edu' [1] '@gmail.com' [4] '@gmx.de' [1] '@hanme.com' [1] '@hotmail.com' [1] '@mathworks.com' [2] '@uni.ac.za' [1] '@xyzabc.com' [1] '@yahoo.co.in' [2] '@yahoo.com' [1]
Loren's Solution Using strfind and accumarray
I came late to the game on this one. Before seeing any solutions except Seth's, I thought about using strfind to locate the domain, and accumarray to gather the data.
inds = strfind(s,'@'); allSuffixes = cellfun(@(str,ind) str(ind:end), s, inds, ... 'UniformOutput', false); [uniqueSuffixes,ignore,uind] = unique(lower(allSuffixes)); % I'd be remiss if there was no solution using accumarray counts = accumarray(uind,1,[length(uniqueSuffixes) 1]); [uniqueSuffixes num2cell(counts)]
ans = '@bname.com' [1] '@erau.edu' [1] '@gmail.com' [4] '@gmx.de' [1] '@hanme.com' [1] '@hotmail.com' [1] '@mathworks.com' [2] '@uni.ac.za' [1] '@xyzabc.com' [1] '@yahoo.co.in' [2] '@yahoo.com' [1]
Other Ways, Other Data Structures?
Yes, there are, no doubt, other ways to complete this same task, including some ideas for the counting such as those in this post. Do you have other favorite ways to do this sort of task? Besides containers.Map, what other data structures would you like to see in MATLAB? As usual, please post your thoughts here.
- Category:
- New Feature