This is the point when I’m probably supposed to break out the dictionary definition of statistics, but I’m not going to because it means pretty much what you probably think it does. What I am going to do though, in somewhat bewildered obeisance to the fact that there are now even specific clichés associated with stats tutorials is break out the quotation.
“There are three kinds of lies, lies, damn lies, and statistics”
Everyone has heard this one. It is usually attributed to Mark Twain, but the Internet informs me that its creation should instead be associated with one of a much broader range of historical people who, as with most famous quotations, almost certainly stole it from another witty but significantly less famous person anyway.
This is a great line, and you should be sure to share it with any statisticians that you know, especially if you are bigger than them and it wouldn’t be considered socially acceptable for them to hit you at that specific point in time.
And there is certainly an element of truth to it. Although is something very wrong about the thought that even mathematicians get opportunities to use their powers for evil, statistics can, and often is, abused. A lot of problems in this world can be attributed to the fact that statistics allows people to replace “because I said so” with the smug presentation of pages of very complicated sums.
When someone makes an argument with words then anyone can respond with more words, but when they make it with maths most people find themselves helpless. People don’t like this. Statistics allow disreputable people to mislead and stupid or lazy people to make mistakes that will go unnoticed because checking their work would involve a lot of work.
All too often then, we take this quotation to heart, repeat it to every statistician that we meet until they flee sobbing into the night, and use it to kindle a smug sense of superiority because, if there is one thing that we love, it’s opportunities to feel smugly superior about people we are worried are smarter than us.
There is just one huge problem with all of the above, and that is that all of the problems I just mentioned, are actually problems with people.
The terrible truth
If there is a dark secret concerning statistics, it is simply this, statistics works. Statistics does exactly what it says on the tin, even if people seldom bother to actually read that tin and the labels can be swapped round by other unscrupulous people to sell you tins of caviar that will later turn out to be composed entirely of beans.
Happily, as you may have noted, this doesn’t really let the statisticians off the hook, so you can probably carry on feeling superior to them if you really want to.
If you can’t beat them…
So while I appreciate at this point you may not be experiencing any upswing in warm fuzzy feelings about statistics, I hope you can see a few reasons to actually learn more about it
First of all, you were absolutely correct to be suspicious of other people’s statistics, and I hate to break it to you, but if you want to find yourself in a position to identify and fend off bad data as it is presented to you, you will have to study this stuff yourself. In this respect, and only this respect, statistics are exactly like Jedi force powers.
Secondly, if you find yourself with data of your own to analyse, you have a lot to gain by using proper stats. Stats do work, using them is nothing to be ashamed of (honest), and no one is going to make you buy a pocket protector unless you really want to.
Stats will tell you whether your data tells what you think it does. This is important. And whilst, as I have previously suggested, finding out that your data doesn’t tell you it what you want it to be telling you is heart breaking, you may wish to consider that finding that out subsequent to the expenditure of another two years of budget can be heart stopping. And you need all the help you can get with this, because you have a brain that is terrible at this stuff.
Yes, I just insulted your brain.
And not just your brain, everybody’s brains. Your Brain is a miracle of evolution (or if you are American, sometimes theology) that has happened to develop in such a way as to be really bad at this kind of data analysis.
Specifically, your brain is good at searching for patterns in things, because finding patterns has been critical to the vast majority of all of the things that people have ever done. In the evolutionary sense patterns means prizes, and in many cases these prizes involve not being eaten or poisoned by things that have eaten or poisoned other people in the past, thus establishing very important trends. Your brain is much better at finding patterns than it is at not finding patterns. Conversely data manipulation often involves ignoring what seem to be obvious patterns in favour of voluntarily doing arithmetic. We can’t really blame Nature for not anticipating this.
People are especially bad at recognising or producing random data. You can investigate this yourself, if you ask people to give you a random number between one and five they will tend to give you a three, between one and ten, a seven, because these numbers feel the most random.
Also, you’re biased
People also tend to seek out results or patterns that confirm their own prejudices or expectations. Taking the previous example, you’ll notice that I’ve not given a cite. The above factoid is a much discussed one, and it ties with my personal experience, but there is a danger here. I can find citations for this which offer experimental tests for this, but none of them actually met the standards for data-set size or results presentation I’m going to use in the rest of this article, and so I’m not going to use them.
The danger is that once something has become acknowledged as “fact” it tends to be reinforced. People will start running their own experiments and only reporting the results if they match their expectations. I started off searching for an example here by using search terms that would lead me to experimental confirmation of trends involving 3 and 7 but would probably miss other people whose experiments had instead found a bias for 1 and 2.
In this particular case I’m fairly confident the pattern is real and does exist, but it’s important to consider this issue, it’s much easier to find something that you expect to find. And it’s very tempting to use data that you don’t really trust because it supports your own position, especially if you can blame someone else if it turns out to be bad.
This leads us neatly to the really important bit.
The really important bit
You’ll notice that I’m distinguishing between data analysis and stats. In my mind (and I don’t really care if the dictionary agrees with me) data analysis is what you do, stats is one of the things that you do it with. Stats are tools, and they are only a subset of the tools that are available to you.
Picture the scene, you are in a room, in front of you is a very complicated machine, a set of mostly recognisable tools and an engineer. That engineer is going to stand silently watching you use his, very expensive, tools to disassemble the machine. You’ll most likely pick up something that looks vaguely like a wrench and move towards something that looks vaguely like a bolt, but first, please look carefully over towards the engineer, you’ll probably notice that the man is wincing.
Statistics comprises a lot of techniques that are used to perform a lot of tasks. Most of these are things that you couldn’t or shouldn’t do yourself, and that’s fine, because you are unlikely to need to. But even with the basic stuff it’s not enough to have a tool, you need to know how and when to use them.
The process of data analysis needs to follow the golden rules and the rules are very simple.
- Maths (if you really have to)
You’ll notice that thinking comes first. The mistake that a lot of people make is assuming that it’s the maths that you need to be thinking about. It’s very easy to obsess about selecting the appropriate test to use and performing it properly. This is what a lot of statistics guides obsess on. That is certainly important, but there are a lot of other places that you can go wrong first.
This is why I told you at the start of this article that searching for a simple guide to the statistical test that you think you should be using is a bad idea, and that is why I’m going to spend a lot of time in the next couple of chapters talking not about statistics, but about the starting point, the data.
So join me next week for Part 3, That stuff you hated in school