# Benford’s Law – What are the odds that the first digit is a ‘1’? 13

Posted by **Loren Shure**,

I'd like to introduce this week's guest blogger Sam Mirsky. Sam is an Application Engineer here at MathWorks who focuses on real-time testing applications using Simulink. However, in this post he will talk about a non-intuitive characteristic of large data sets, and test the idea with a data set which ships with MATLAB.

In a large set of data, it seems that the probability of individual numbers starting with 1 would be the same as any other digit. However, this is not true. There is a much higher probability that the first digit is a 1.

Since the first significant digit is not zero, the intuitive probability of a number starting with 1 (or any other digit) would be 1/9 = 11%. According to Wikipedia: "The first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time."

### Contents

### Load Data

Let us test this with a data set which ships with MATLAB: quake.mat. This is a data set with accelerometer data from an earthquake in California.

`load quake`

### Find digit statistics

stat(1:9) = 0; for i = 1:length(v) string = sprintf('%0.5e', abs(v(i))); firstDigit = str2double(string(1)); switch firstDigit case 1 stat(1) = stat(1) +1; case 2 stat(2) = stat(2) +1; case 3 stat(3) = stat(3) +1; case 4 stat(4) = stat(4) +1; case 5 stat(5) = stat(5) +1; case 6 stat(6) = stat(6) +1; case 7 stat(7) = stat(7) +1; case 8 stat(8) = stat(8) +1; case 9 stat(9) = stat(9) +1; end end

### Plot results

statPercent = stat / sum(v ~= 0); %only use non-zero numbers for stats bar(statPercent); grid on; xlabel('First digit'); ylabel('Percent');

### How this is used

This is one test that is done to test if a data set is real or fabricated. For example, if you collect all the numbers on a federal income tax return, it should also obey Benford's Law.

### How would you use MATLAB to calculate these statistics?

As is typical with MATLAB, there are many ways to derive the same answer:

- What MATLAB commands would you use to analyze the first digit of numbers in a data set?
- Does Benford's Law apply to a data set you have (or not)? Show us your results here.

Get the MATLAB code

Published with MATLAB® 7.14

**Category:**- Fun

### Note

Comments are closed.

## 13 CommentsOldest to Newest

**1**of 13

There is a much faster to calculate statistics of the first digit:

v=abs(v(v~=0)); % clean up

fv=floor(v./(10.^floor(log10(v)))); % calculate first digit

n = histc(fv,1:9); % perform count

bar(n/length(fv)); % plot

**2**of 13

One minor comment on floating point accuracy for Benford’s law. For some numbers the result will differ depending on data type. This has to do with the accuracy of floating point representation of the numbers. So 0.3 in a floating point is not actually 0.3 as stored on the computer and depending on the algorithm might have first digit show up differently. E.g. v=v+eps will have slightly different values than v.

**3**of 13

stat=zeros(9,1);

str=num2str(v);

for t=1:size(str,2)

for s=1:9

i=strfind(str(:,t).’,num2str(s));

str(i,:)=[];

stat(s)=stat(s)+length(i);

end

end

**4**of 13

Yakov and Christian,

Thank you for your 1st digit algorithms! I appreciate your compact, efficient code.

**5**of 13

For fun, I once computed the first digits of the first 3000 Fibonacci numbers. They two nicely obey Benford’s law, as it should be.

**6**of 13

Knowing Benford’s Law, I now know how to escape detection when I cheat. Thanks! :)

**7**of 13

Weather data doesn’t follow Benford’s law. The frequencies of the leading digits of the maximum temperature in Los Angeles between 1970 and 2008 are:

1: 78.8%

2: 2.8%

3: 0.3%

4: 0.3%

5: 1.2%

6: 2.6%

7: 4.0%

8: 6.2%

9: 3.7%

**8**of 13

Sambridge, M., H. Tkalčić, and A. Jackson (2010), Benford’s law in the natural sciences, Geophys. Res. Lett., 37, L22301, doi:10.1029/2010GL044830.

This is a nice little paper about Benford’s law in the natural sciences.

**9**of 13

I did it the same method used by Yakov, on my humble computer this was over 100x faster 0.004s versus 0.6s for the computational piece (excluding the graph plotting)

%tidy up the dataset by excluding zero entries and take the absoulte value

%to invert negative entries

dataset = abs(v(v~=0));

%conversion process uses the log10 to determine a factor that will divide

%the value down such that it is between >= 1 and <10. This value can be

%rounded down to access the first significant digit

conversion = 10.^(floor(log10(dataset)));

first_digit = floor(dataset ./ conversion);

%use the histogram function to count the entries

num = histc(first_digit,1:9);

%plot the chart

bar(100 .* (num./length(dataset)))

grid on;

xlabel('First digit');

ylabel('Percent');

**10**of 13

Also see:

http://stackoverflow.com/questions/2602365/how-to-implement-benfords-law-in-matlab

**11**of 13

Matt-

The weather data has some extra “constraints” that many other data sets don’t have. Including units associated with them. And the way the temperature scale is set up, in most places on Earth, “1” is definitely not expected as the leading digit more than others – you’d either be very cold or very hot, in Fahrenheit at least!

–Loren

**12**of 13

Matt, Loren,

about data sets that display this type of behavior,

the main characteristic that this is a sign of is that the data is “scale-free” , i.e. there is no “typical” scale.

We tend to think that most distributions are “normal” (good P.R. calling it normal), and we give examples of hight, weight, intelligence, etc

These all have a specific “scale”, you could have someone particularly smart , or tall, or heavy, but it will not be by orders of magnitude.

In the real world, many distributions are in fact, scale free. some examples are (besides earthquakes )

populations of cities

market cap of companies

# links pointing to a site

personal wealth

size of meteors

size of avalanches

market returns of stocks

all of these span at least 6 orders of magnitude

there is no “typical” size.

so this works.

people have mentioned the Fibonacci seq. this is also true of the sequence of the powers of two (or any number..)

now, if there is no scale involved, the original intuition of 11.1r% per digit fails. and is replaced by the length of things on the logarithmic scale

fun

**13**of 13

John D, Frederik, Manny, and Vish,

Thank you all for your additional comments and info! I have really enjoyed learning the additional applications of the law.

## Recent Comments