I'd like to introduce this week's guest blogger Sam Mirsky. Sam is an Application Engineer here at MathWorks who focuses on real-time testing applications using Simulink. However, in this post he will talk about a non-intuitive characteristic of large data sets, and test the idea with a data set which ships with MATLAB.
In a large set of data, it seems that the probability of individual numbers starting with 1 would be the same as any other digit. However, this is not true. There is a much higher probability that the first digit is a 1.
Since the first significant digit is not zero, the intuitive probability of a number starting with 1 (or any other digit) would be 1/9 = 11%. According to Wikipedia: "The first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time."
Let us test this with a data set which ships with MATLAB: quake.mat. This is a data set with accelerometer data from an earthquake in California.
stat(1:9) = 0; for i = 1:length(v) string = sprintf('%0.5e', abs(v(i))); firstDigit = str2double(string(1)); switch firstDigit case 1 stat(1) = stat(1) +1; case 2 stat(2) = stat(2) +1; case 3 stat(3) = stat(3) +1; case 4 stat(4) = stat(4) +1; case 5 stat(5) = stat(5) +1; case 6 stat(6) = stat(6) +1; case 7 stat(7) = stat(7) +1; case 8 stat(8) = stat(8) +1; case 9 stat(9) = stat(9) +1; end end
statPercent = stat / sum(v ~= 0); %only use non-zero numbers for stats bar(statPercent); grid on; xlabel('First digit'); ylabel('Percent');
This is one test that is done to test if a data set is real or fabricated. For example, if you collect all the numbers on a federal income tax return, it should also obey Benford's Law.
As is typical with MATLAB, there are many ways to derive the same answer:
- What MATLAB commands would you use to analyze the first digit of numbers in a data set?
- Does Benford's Law apply to a data set you have (or not)? Show us your results here.
Get the MATLAB code
Published with MATLAB® 7.14
13 CommentsOldest to Newest
There is a much faster to calculate statistics of the first digit:
v=abs(v(v~=0)); % clean up
fv=floor(v./(10.^floor(log10(v)))); % calculate first digit
n = histc(fv,1:9); % perform count
bar(n/length(fv)); % plot
One minor comment on floating point accuracy for Benford’s law. For some numbers the result will differ depending on data type. This has to do with the accuracy of floating point representation of the numbers. So 0.3 in a floating point is not actually 0.3 as stored on the computer and depending on the algorithm might have first digit show up differently. E.g. v=v+eps will have slightly different values than v.
Yakov and Christian,
Thank you for your 1st digit algorithms! I appreciate your compact, efficient code.
For fun, I once computed the first digits of the first 3000 Fibonacci numbers. They two nicely obey Benford’s law, as it should be.
Knowing Benford’s Law, I now know how to escape detection when I cheat. Thanks! :)
Weather data doesn’t follow Benford’s law. The frequencies of the leading digits of the maximum temperature in Los Angeles between 1970 and 2008 are:
Sambridge, M., H. Tkalčić, and A. Jackson (2010), Benford’s law in the natural sciences, Geophys. Res. Lett., 37, L22301, doi:10.1029/2010GL044830.
This is a nice little paper about Benford’s law in the natural sciences.
I did it the same method used by Yakov, on my humble computer this was over 100x faster 0.004s versus 0.6s for the computational piece (excluding the graph plotting)
%tidy up the dataset by excluding zero entries and take the absoulte value
%to invert negative entries
dataset = abs(v(v~=0));
%conversion process uses the log10 to determine a factor that will divide
%the value down such that it is between >= 1 and <10. This value can be
%rounded down to access the first significant digit
conversion = 10.^(floor(log10(dataset)));
first_digit = floor(dataset ./ conversion);
%use the histogram function to count the entries
num = histc(first_digit,1:9);
%plot the chart
bar(100 .* (num./length(dataset)))
The weather data has some extra “constraints” that many other data sets don’t have. Including units associated with them. And the way the temperature scale is set up, in most places on Earth, “1” is definitely not expected as the leading digit more than others – you’d either be very cold or very hot, in Fahrenheit at least!
about data sets that display this type of behavior,
the main characteristic that this is a sign of is that the data is “scale-free” , i.e. there is no “typical” scale.
We tend to think that most distributions are “normal” (good P.R. calling it normal), and we give examples of hight, weight, intelligence, etc
These all have a specific “scale”, you could have someone particularly smart , or tall, or heavy, but it will not be by orders of magnitude.
In the real world, many distributions are in fact, scale free. some examples are (besides earthquakes )
populations of cities
market cap of companies
# links pointing to a site
size of meteors
size of avalanches
market returns of stocks
all of these span at least 6 orders of magnitude
there is no “typical” size.
so this works.
people have mentioned the Fibonacci seq. this is also true of the sequence of the powers of two (or any number..)
now, if there is no scale involved, the original intuition of 11.1r% per digit fails. and is replaced by the length of things on the logarithmic scale
John D, Frederik, Manny, and Vish,
Thank you all for your additional comments and info! I have really enjoyed learning the additional applications of the law.