What this is
This article is aimed at the average person who wants to know more about evaluating their data. It is an examination of the theory behind statistics and guide to some of the most obvious pitfalls and problems.
It is intended to form an accessible introduction to data analysis for people who don’t come from an analytical background such as journalists, social media account managers, and new graduate students. Essentially this is the guide that I wish that I had when I was starting my PhD.
I will be presenting some walkthroughs on basic statistical techniques, but I’m also going to spend quite a lot more talking about them. If you are the kind of person that is currently consolidating all your data into two means on a single bar chart and feely vaguely guilty about it then this is the guide for you, and I will do my best to make you feel considerably more guilty by the end of all of all this.
What this isn’t
Statistics is a very complicated subject and this is not an exhaustive examination of it. I have worked quite a bit with stats and I have worked hard to check this article and point out the inevitable over simplifications as I make them, however my background is in science, I am not a statistician or a mathematician.
Obviously the standard disclaimer applies, if you are need to do something absolutely critical with data, you probably shouldn’t be relying solely on random blog articles. I’ll pause now so you can contemplate the wisdom of this, possibly make a mental note to double check anything you learn at some vaguely undefined point in the future, and then proceed as normal, because, after all, we all seem to learn everything from random blog articles these days.
And I should read this because?
A reasonable question to ask at this point then, probably involves why exactly you should read this guide. Why you shouldn’t look for something written by some distinguished statistician who is most certainly not going to make any mistakes, or at least none that won’t later form the subject of someone else’s thesis.
This is going to come down to what may seem to be a bewildering admission. I do not like stats.
I am not a natural mathematician, I do not enjoy sums. I am a biologist, I have had to work quite a lot with data and like many other biologists I did not enjoy doing so. My natural inclination when approaching data analysis is to do as little actual maths as possible. Like many other scientists I have sat at my keyboard, heart sinking as the application of a treacherous sum suddenly tells me that the results that I needed are suddenly beyond my reach again.
I do not like stats.
Why is this good?
Well for a start, when I tell you that you need to know about this, you can really believe me. I am not someone who is quick to resort to numbers but I’m about to spend quite a lot of time selling you on their importance. When I tell you that you need to apply a statistical test to something you can be assured that, in the same situation, I have probably devoted far too much mental energy into trying to avoid doing so.
It also means that I have something of an appreciation of the stumbling blocks for those amongst you that don’t intuitively understand numbers, which let’s face it, is pretty much everyone apart from the specific subset of people that typically write statistics guides.
I have an appreciation for the second part of this picture that far too many people miss. Data analysis is about much more than statistics and maths. They are important, but they are only part of the picture. More than anything else, the best and most effective way of reducing the amount of time you will have to spend doing stats, is to make sure that when you do, you are always doing the right stats using the right data.
The Bad news
I want to start off with some honesty. This is a basic guide to a complicated subject. There is only so much material that can be skipped over, before this whole endeavor becomes meaningless.
This isn’t a short series of articles, and some of the concepts are sufficently complicated that they will likely require some breaks and re-reading, but if they are in the guide, it’s because I honestly believe that you will need to understand them to perform useful data analysis.
Over all else, statistics is a subject that requires a solid grounding in the fundamentals, and the failure to achieve this will often doom your analysis before you have even finished assembling your data.
So, please try to work through these articles in the order that they are presented in. The subjects are presented in a particular order for a reason. There will probably be some repetition of concepts for a lot of readers, but I hope that there will be valuable material throughout, for just about everyone.
1) This Introduction – The one that you are reading now. In which I describe what I am trying to achieve with these articles.
2) So what are statistics, and what did they ever do for me? – A discussion of exactly what statistics is, why you should care about it. It is also going to talk about why data analysis is about a lot more than maths, charts and stats.
3) That stuff you hated in school – Is a reintroduction to basic statistical concepts such as the mean, medians, and modes (oh my). As well as the much neglected Range.
4) Assumptions, the root of all applicable evil – is going to talk about the importance of understanding your data, knowing where it came from and cultivating an awareness of the assumptions that you are making.
5) Size Matters – Why big groups are better. Talks about sample size, as well as distribution and normality, and formally introduces you to Variance.
6) More stuff that you hoped you’d never come across again – Is going to talk about basic data analysis, error bars, chart usage and spread sheet techniques as well as problems with the above. In a rare act of mercy, there will be an option to skip past the charts and spread sheets bits for people who are already familiar with them.
7) How to lie with statistics and why you shouldn’t – Is going to contain quite a lot of examples about how data can be presented in a misleading fashion, and why you shouldn’t do this.
8) I bet you never thought you’d look forwards to stats – Is going to involve actual statistics, primarily a discussion of t-tests and a walkthrough to performing them with Excel. It is possible but unlikely that the sense of anticipation that you will feel as a result of the delay in getting to this part will overrule the small detail that maths will have to happen.
9) Numbers in their natural environment- Will provide a series of randomised datasets that people can use to practice these techniques, and get a feel for realistic datasets as well as instructions for generating your own.
So, starting off
Why should you even care about statistics?
Join me in part 2 to find out.