Loren on the Art of MATLAB

Benford’s Law – What are the odds that the first digit is a ’1′? 13

Posted by Loren Shure,

I'd like to introduce this week's guest blogger Sam Mirsky. Sam is an Application Engineer here at MathWorks who focuses on real-time testing applications using Simulink. However, in this post he will talk about a non-intuitive characteristic of large data sets, and test the idea with a data set which ships with MATLAB.

In a large set of data, it seems that the probability of individual numbers starting with 1 would be the same as any other digit. However, this is not true. There is a much higher probability that the first digit is a 1.

Since the first significant digit is not zero, the intuitive probability of a number starting with 1 (or any other digit) would be 1/9 = 11%. According to Wikipedia: "The first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time."

Contents

Load Data

Let us test this with a data set which ships with MATLAB: quake.mat. This is a data set with accelerometer data from an earthquake in California.

load quake

Find digit statistics

stat(1:9) = 0;
for i = 1:length(v)
    string = sprintf('%0.5e', abs(v(i)));
    firstDigit = str2double(string(1));
    switch firstDigit
        case 1
            stat(1) = stat(1) +1;
        case 2
            stat(2) = stat(2) +1;
        case 3
            stat(3) = stat(3) +1;
        case 4
            stat(4) = stat(4) +1;
        case 5
            stat(5) = stat(5) +1;
        case 6
            stat(6) = stat(6) +1;
        case 7
            stat(7) = stat(7) +1;
        case 8
            stat(8) = stat(8) +1;
        case 9
            stat(9) = stat(9) +1;
    end
end

Plot results

statPercent = stat / sum(v ~= 0);  %only use non-zero numbers for stats
bar(statPercent);
grid on;
xlabel('First digit');
ylabel('Percent');

How this is used

This is one test that is done to test if a data set is real or fabricated. For example, if you collect all the numbers on a federal income tax return, it should also obey Benford's Law.

How would you use MATLAB to calculate these statistics?

As is typical with MATLAB, there are many ways to derive the same answer:

  • What MATLAB commands would you use to analyze the first digit of numbers in a data set?
  • Does Benford's Law apply to a data set you have (or not)? Show us your results here.


Get the MATLAB code

Published with MATLAB® 7.14

13 CommentsOldest to Newest

There is a much faster to calculate statistics of the first digit:

v=abs(v(v~=0)); % clean up
fv=floor(v./(10.^floor(log10(v)))); % calculate first digit
n = histc(fv,1:9); % perform count
bar(n/length(fv)); % plot

One minor comment on floating point accuracy for Benford’s law. For some numbers the result will differ depending on data type. This has to do with the accuracy of floating point representation of the numbers. So 0.3 in a floating point is not actually 0.3 as stored on the computer and depending on the algorithm might have first digit show up differently. E.g. v=v+eps will have slightly different values than v.

stat=zeros(9,1);
str=num2str(v);
for t=1:size(str,2)
for s=1:9
i=strfind(str(:,t).’,num2str(s));
str(i,:)=[];
stat(s)=stat(s)+length(i);
end
end

Yakov and Christian,
Thank you for your 1st digit algorithms! I appreciate your compact, efficient code.

For fun, I once computed the first digits of the first 3000 Fibonacci numbers. They two nicely obey Benford’s law, as it should be.

Weather data doesn’t follow Benford’s law. The frequencies of the leading digits of the maximum temperature in Los Angeles between 1970 and 2008 are:

1: 78.8%
2: 2.8%
3: 0.3%
4: 0.3%
5: 1.2%
6: 2.6%
7: 4.0%
8: 6.2%
9: 3.7%

I did it the same method used by Yakov, on my humble computer this was over 100x faster 0.004s versus 0.6s for the computational piece (excluding the graph plotting)

%tidy up the dataset by excluding zero entries and take the absoulte value
%to invert negative entries
dataset = abs(v(v~=0));
%conversion process uses the log10 to determine a factor that will divide
%the value down such that it is between >= 1 and <10. This value can be
%rounded down to access the first significant digit
conversion = 10.^(floor(log10(dataset)));
first_digit = floor(dataset ./ conversion);
%use the histogram function to count the entries
num = histc(first_digit,1:9);

%plot the chart
bar(100 .* (num./length(dataset)))
grid on;
xlabel('First digit');
ylabel('Percent');

Matt-

The weather data has some extra “constraints” that many other data sets don’t have. Including units associated with them. And the way the temperature scale is set up, in most places on Earth, “1″ is definitely not expected as the leading digit more than others – you’d either be very cold or very hot, in Fahrenheit at least!

–Loren

Matt, Loren,

about data sets that display this type of behavior,
the main characteristic that this is a sign of is that the data is “scale-free” , i.e. there is no “typical” scale.

We tend to think that most distributions are “normal” (good P.R. calling it normal), and we give examples of hight, weight, intelligence, etc

These all have a specific “scale”, you could have someone particularly smart , or tall, or heavy, but it will not be by orders of magnitude.

In the real world, many distributions are in fact, scale free. some examples are (besides earthquakes )

populations of cities
market cap of companies
# links pointing to a site
personal wealth
size of meteors
size of avalanches
market returns of stocks

all of these span at least 6 orders of magnitude
there is no “typical” size.
so this works.
people have mentioned the Fibonacci seq. this is also true of the sequence of the powers of two (or any number..)

now, if there is no scale involved, the original intuition of 11.1r% per digit fails. and is replaced by the length of things on the logarithmic scale

fun

John D, Frederik, Manny, and Vish,

Thank you all for your additional comments and info! I have really enjoyed learning the additional applications of the law.

These postings are the author's and don't necessarily represent the opinions of MathWorks.