The other day I was part of a discussion regarding how to detect fraud in large datasets and I was reminded of Benford’s Law… it really is quite amazing.

Essentially the law states that in a large enough collection of naturally occurring statistics the frequency of the first digit will tend towards “1”. This holds true for things as diverse as house numbers and stock prices.

If a large set of numbers is generated by a psudo number generator to say, fake atmospheric pollution data, the resulting set of numbers will not adhere to Benford’s Law and therefore can be assumed to be fake. It is obviously a trivial exercise to analyse a large set of data and modify it to adhere to Benford’s Law.

The table from the wikipedia article shows the logarithmic nature of these numbers.

Leading digit |
Probability |

1 |
30.1% |

2 |
17.6% |

3 |
12.5% |

4 |
9.7% |

5 |
7.9% |

6 |
6.7% |

7 |
5.8% |

8 |
5.1% |

9 |
4.6% |

What does facinate me even further is the question: Would a large set of randomly generated numbers generated by atmospheric noise random number generators adhere to Benford’s Law, and if not, Why?. Surely if pollution data does adhere to the law then atmospheric noise, which is much like pollution, should too.

I think I might generate a large set of random numbers generated by atmospheric noise and have a look at it while I’m disconnected at the river this weekend.

I hope this puzzles you as much as it does me…

j.

### Like this:

Like Loading...

*Related*

$ for i in `seq 1 160` ; do echo $i | cut -c1; done | sort | uniq -c

72 1

11 2

11 3

11 4

11 5

11 6

11 7

11 8

11 9

*gasp* The first 160 numbers stick to Benford’s law!

Let’s go higher:

$ for i in `seq 1 260` ; do echo $i | cut -c1; done | sort | uniq -c

111 1

72 2

11 3

11 4

11 5

11 6

11 7

11 8

11 9

*GASP*!

It’s a truism, dude. Obviously you find more numbers beginning with 1, because, like, 1 is the beginning of a larger number of lower numbers. We don’t start numbering houses at 56, we start at 1, and when we get to 100, then *every* house number begins with 1.

In statistical data, the numbers are not random, so there’s no reason to expect them to be randomly distributed.

You know, this morning before you left for work you said you were hoping to get Neil to comment on this. In my head I was think “I bet jonathan will be the first person to comment on this”. Tada, I was right.

That maths gene really runs in the family, eh Jonathan H?

(I on the other hand, still need my fingers to count….)