The other day I was part of a discussion regarding how to detect fraud in large datasets and I was reminded of Benford’s Law… it really is quite amazing.
Essentially the law states that in a large enough collection of naturally occurring statistics the frequency of the first digit will tend towards “1”. This holds true for things as diverse as house numbers and stock prices.
If a large set of numbers is generated by a psudo number generator to say, fake atmospheric pollution data, the resulting set of numbers will not adhere to Benford’s Law and therefore can be assumed to be fake. It is obviously a trivial exercise to analyse a large set of data and modify it to adhere to Benford’s Law.
The table from the wikipedia article shows the logarithmic nature of these numbers.
What does facinate me even further is the question: Would a large set of randomly generated numbers generated by atmospheric noise random number generators adhere to Benford’s Law, and if not, Why?. Surely if pollution data does adhere to the law then atmospheric noise, which is much like pollution, should too.
I think I might generate a large set of random numbers generated by atmospheric noise and have a look at it while I’m disconnected at the river this weekend.
I hope this puzzles you as much as it does me…