Doug's MATLAB Video Tutorials
April 8th, 2010
Puzzler: Benford’s law
I had a question come in recently from Denmark about wanting to implement Benford’s law.
I first heard about this law on the Radiolab episode about Numbers. Radiolab is one of my favorite podcasts, highly recommended!
This law basically says that
“in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty. This distribution of first digits arises whenever a set of values has logarithms that are distributed uniformly, as is approximately the case with many measurements of real-world values.”
(wikipedia)
Apparently this law can be applied to measurements of natural phenomenon like coastal windfarm energy output, numbers of geese migrating, and various areas of waterflow if they follow the above described distribution.
Where this law becomes more than a novelty is looking at accounting expense reports. Don’t try and sneak those tickets to see the OB play or the Mighty Boosh in among your travel expenses (even if you are about to leave the company!) When people are making up fake expenses, they will choose numbers from a normal distributions. However, they should be choosing them with more ones, and less nines as the leading digit!
Anyways, I found it surprisingly challenging to extract the leading non-zero digit from a number. Remember, the number can be a decimal, negative or both. I am always amazed at the elegant solutions people come up with for simple problems like this.
SAfe to say that if you post your function in the comments, the most elegant (whatever that means at the time I am looking them over…) get’s some MATLAB swag.
0.001345 –> 1
3452.3 –> 3
-582.3 –> 5
etc…
By
Doug Hull
Doug first used MATLAB in 1994, could not figure it out until he got some help in 1995. He is now dedicated to making sure that no one else wastes a year of their life not knowing MATLAB like he did.
15:58 UTC |
Posted in Topic: Puzzler |
Permalink |
21 Comments »
You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.
Leave a Reply
|
The denominator normalizes x by dividing by the highest power of 10 that is less than x. This yeilds a number between 1 and 9.9999999. Then we just use floor to get the leading digit. To handle negative numbers, we just use abs(x) instead of x. This will break down for x=0, but then I’m not sure what the best return value is in that case.
Guess this works, but has four function calls at least:
This was the best I could come up with, during a quick lunchtime diversion. I would NOT classify this as elegant, but at least it sets a bar:
>> d = abs(fix(x./(10.^floor(log10(abs(x))))))
lol … I guess I was 8 minutes too slow (but I swear it was independent)
s = sprintf('%.16e',abs(x)); d = s(1); % or d = s(1) - '0';Doug’s is pretty interesting. Won’t work on vector x, though.
I was thinking along Doug’s lines… but was too late.
Anyway, this one is vectorable.
function leadingDigits = extractLeadingDigit(x)
str = num2str(abs(x(:)),’%.16e’);
leadingDigits = reshape(str2num(str(:,1)),size(x));
end
I see solutions have allready been presented. I wrote something quite similar to Mark Pattersons solution in my first edition of cloudPlot. Knowing the first digit of your ranges is critical to drawing pretty axis with appropriate limits and grid sizes.
My first thought was:
… but alas, it does not work for vectors.
However, Doug’s could be vectorized as follows…
str = sprintf('%1.16e',abs(x))'; firstDigit = reshape(str2num(str(1:23:end)),size(x))Here’s are vectorized solution using REGEXP:
One other interesting tidbit: the string-based solutions will give the right answer for x = eps(realmin), while the log-based solutions give Inf.
Oops! Forgot to include “,16″ in the call to NUM2STR.
Keep values numerical instead of string-based.
—————-
function first_dig_input(x) y=abs(x); if y < 1 % for input < 1 while y 10 % for input > 10 while y > 10 y=y./10; end else end soln=floor(y); sprintf('The first digit of %f is %d.',x,soln)‘<’ wordpress error fixed
—————
function first_dig_input(x) y=abs(x); if y < 1 while y < 1 y=10*y; end elseif y > 10 while y > 10 y=y./10; end else end soln=floor(y); sprintf('The first digit of %f is %d.',x,soln) endThe decimal part can be extracted with mod( X, 1 ), so:
>> X = [ 0.001345 3452.3 -582.3 ]; >> D = floor( 10.^mod( log10( abs(X) ), 1 ) ) D = 1 3 5>> X = 5000; >> D = floor( 10.^mod( log10( abs(X) ), 1 ) ) D = 4Welcome to sick sad world. Second version, less pretty, more functional:
>> X = [ 0.001345 3452.3 -582.3 5000 ]; >> D = floor( round( 10.^( 1 + mod( log10( abs(X) ), 1 ) ) ) / 10 ) D = 1 3 5 5My second solution is worth than the first.
@Mark Patterson,
All the solutions were good, yours was the first, so send me your t-shirt size and address. Will get something out to you soon.
Thanks all!
Doug
Doug,
I learned of Benford’s Law years ago and was facintated by it. It has been used to find fake tax returns among other things. The history of the discovery of this is also fascinating, and dates back to (paper) books of logarithm values used by scientists before calculators (or MATLAB) was available.
To test the law, I used the script below on real earthquake data which is shipped with MATLAB. I feel there are much more elegant and efficient methods to program this, but this was my first thought.
Besides the earthquake data, I also tried it with other shipping data: ‘seamount’, ‘penny’ (2-D), and ‘wind’ (3-D)
%% Load Data load quake v = e; vdim = length(v); %% Find digit statistics stat(1:9) = 0; for i = 1:vdim s = sprintf('%0.5e', abs(v(i))); v2(i) = str2num(s(1,1)); switch v2(i) case 1 stat(1) = stat(1) +1; case 2 stat(2) = stat(2) +1; case 3 stat(3) = stat(3) +1; case 4 stat(4) = stat(4) +1; case 5 stat(5) = stat(5) +1; case 6 stat(6) = stat(6) +1; case 7 stat(7) = stat(7) +1; case 8 stat(8) = stat(8) +1; case 9 stat(9) = stat(9) +1; end end %% Plot results statn = stat / vdim; bar(statn); grid on;@Sam,
Interesting. There is a weird spike in the ’5′ bin, but it fits pretty well otherwise!
Doug