I am hopeful that I can get things settled down into a more normal schedule at some point in the immediate future, but until that happens I’d prefer not to get too carried away with filler content to the point of drowning out the writing resources that I’m intending to be the main focus of the site. My interest in the site is not waning, and I will still be updating the site as often as possible, it’s just that the updates are unlikely to adhere to a rigid schedule.

My plan for this website has always involved the long haul, and I hope that my readers can have some patience.

I would very much welcome any feedback that people can give. It’s easier for me to justify devoting time to the site right now, if you can give me some reassurance that people are actually finding useful content here. Conversely if people aren’t finding the site helpful, then I *need* to know that, especially if you can offer constructive suggestions as to how to improve things.

Thanks

]]>This is because they are *important*.

Random data is random. People forget this, *a lot*.

This is especially important when you are using Mean values, simply because nothing about a sample mean offers any useful information about its reliability as an estimation of the population mean. It’s very easy for them to mislead.

The above chart illustrates the mean values of 5 separate groups of data. The data in each set was comprised of ten *entirely* random numbers between 1 and 10. The final bar indicates the mean of all 50 values.

In this case you can treat the *population mean* as 5, and each set as a separate sample group the mean of which is obviously the *sample mean*. The final bar treats all 5 sets as one single group of 50 values, and indicates the *sample mean* for that case.

As you can see, even with ten data-points in a group, some of the the *sample mean* values differ substantially from the expected *population mean* of 5.

Conventional wisdom suggests a minimum of three samples in each group before you start to work with data, but even with 10 samples a deviation of up to 20% from the expected sample means is apparent. If we treat all 50 values as a single sample, the sample mean almost exactly matches the population mean.

This is another set of mean values. There are 20 values here, representing the *deviation* from the expected *population mean* (50), of the *sample mean* of a set of 20 groups, each containing between 5 and 100 values. Each individual value was randomly generated between 1 and 100.

In other words, the graph does not show the *sample means*, it shows the extent to which they deviate from the expected *population mean.*

There are a couple of important things to take on board here.

- A small group is not
*always*associated with a high deviation. - The larger the group is, the less
*likely*it is that the results will be skewed significantly by chance.

It is important to realise that it is still possible for a large group of samples to return a mean that deviates substantially from the true population mean*,* it just becomes *less likely* as the sample size is increased. The issue is one of *variance*, which is one of the next concepts that I’m going to discuss. But before I do that we need to discuss something else.

You may have noticed that the examples of random data that I gave in the previous example seemed to result in more variability in the sample means than seemed intuitive. This is because of how it was *distributed*.

It’s probably worth taking the time here to be clear on the difference between random distribution and random generation. Randomly *generated* numbers are those produced as a result of an ostensibly random process, in this case Excel’s number generation function. The randomly generated values that I showed you before, are of* uniform* distribution, that is to say, the numbers have an equal chance to fall anywhere in the specified range.

One of the reasons that people are so bad at dealing with randomness is that it is not a common occurrence in real life. In reality, data tends to cluster around the sample mean. In fact, a lot of statistical procedures *assume* that data will follow what is called a normal distribution.

In the above chart each line of points represents a separate set of randomly generated data. The first line consists of 50 values that are *uniformly* distributed, the second, 50 *normally* distributed values. You can see that the first set spread out fairly evenly between 1-100, whereas the second set tend to cluster nearer the expected mean of 50, although they still represent a considerable *range* of values.

The pattern is not as neat as you might expect, because the values are still *generated* randomly. Which means that any pattern of data could have occurred. Excel could have delivered me 50 identical values of 100, for example, in *both* cases, it’s just (very) unlikely.

Most statistics programs will inform you if your data is not normally distributed, although they will often allow you to proceed anyway.

You need to be aware if your data is not normal. It may prevent the test from working properly, but it may also reveal problems with your data that you are not aware of.

Possible reasons why your data does *not* follow a normal distribution –

- There is a genuine and reasonable expectation that data would be randomly distributed – For example, you are looking for evidence of bias in what
*should*be a random system such as a roulette wheel.

- Subgroups within your data – If your data shows distinct clusters of normality it suggests that the factor that you are investigating is not consistent across the sample group or target population. It might also imply that the data was collected or processed inconsistently or conglomerated from separate sets.

- Exponential growth – A good example of this would be the way a story or joke is shared by Twitter users. Because each retweet increases the pool of people who can make further re-tweets, the most successful Twitter posts will tend to be dramatically more shared than the least. You can expect to see this pattern a lot with social media data.

- Fraud – You need to be careful here, but it is not uncommon for people who are falsifying data, but who do not have an in depth knowledge of statistics to do this by generating a
*uniformly*random number series within a specified range, rather than a normally distributed one. You should be suspicious of*truly*uniform distributions within a large set or subset of numbers unless there is a very good explanation. The scarcity of uniform distributions within nature means that, with large groups of data, it is difficult for them to happen by accident or as a result of innocent mistakes that don’t happen to involve random number generators. but…

- It
*can*still happen by chance – As previously stated, you will never be able to be absolutely certain about data unless you have all of it, and if you have all of it, you don’t need to make predictions about it.

You can never get away from chance when you are working with samples. Statistics does not allow you to make definitive statements about what is happening, but it allows you to determine how *confident* that you can be in your predictions.

Start by figuring out why your data isn’t normally distributed.

If your data is showing clusters of data, the best approach is going to involve trying to untangle the subgroups from each other, obviously this may reduce the sample size below the level required for good quality data.

In some cases, especially with data associated with exponential growth, data that isn’t normally distributed can be “transformed” to meet a normal distribution, however this is very context specific.

As you should now realise, the problems associated with small data-sets are a function of their increased exposure to chance.

You should also understand that representing data from a group of samples, just by indicating the mean values, provides no information about the level of variability *within* that group.

*Variance* describes the extent to which individual values fall close to the sample mean.

With the examples of random numbers given above, the numbers come from a machine, and because machines suck at random, these will result from a natural process of some type, which while not random either, will in turn be influenced by other processes, and so on until we have split enough hairs that quantum processes, that may actually be random, are involved, and at any rate it’s long since become impossible to keep track.

Remember that variance doesn’t spring into existence just because the universe hates you (although that’s my go-to explanation for a lot of other stuff). It come from those aforementioned external influence, which we call *factors*.

Generally if you are doing statistics, it’s because you wish to investigate one or more of these *factors*, but trying to do this doesn’t make all the other factors go away and stop influencing your samples, because even if the universe doesn’t hate you, it’s not about to start doing you any favours

So..

*Normality* describes the pattern that data most often falls in, and *variance* describes how precisely this actually occured. The variance is determined by all of the *factors* acting on your sample values other than the one that you using to define your target population.

Considering the *assumptions* that you have made will often allow you to identify some of the most important *factors* ahead of time, understand their interaction with the one you wish to investigate and help you to assemble sample data that is more representative of the target population.

Variance reduces your ability to make accurate predictions about the target population from your sample values, and this in turn decreases your *confidence* in those predictions.

If you can’t identify or correct for the influence of additional factors your data, you can oppose the resulting variance by increasing the number of samples that you take.

So how do we indicate variance and quantify confidence?

Come back for the next article in this series, in which I will finally start to talk about the actual process of data manipulation, starting with calculation and *appropriate use* of standard deviation and standard error.

So, we’ve had our little refresher on mean, medians, and modes, Now we are going to look at the *data*, the starting point for your analysis, and very often the starting point for your screw ups.

Data can be broken down in to a broad range of different categories, all of which have distinct names; and this has a big impact on what you can do with it.

In deference to my lack of faith in your ability to remember names and my continued scepticism as to the actual real world value of remembering what things are called, I’m *not* going to approach this in the traditional manner and list them all.

What’s important is that you *think* about your data. If you think about your data, what it represents, and what you are intending to do with it, there are actually only a few situations in which you are likely to run into trouble.

This is mostly intuitive stuff. For example, a lot of the potential problems involve non-numerical data. Consider the previously used example of data regarding the mode of transport used to get to a hospital, it is not appropriate to do most statistical operations on this data, but it is also conveniently *impossible* to perform most mathematical operations on it, making this very easy to remember.

You can calculate a *mode* for this data, but we’ve already discussed why you probably won’t need reminding to do this.

There are some data *types* that you need to watch out for however.

1) **Numbers that aren’t really numbers** – for example numbered groups. Consider a school that has divided its students into ten separate reading groups *arbitrarily*.

Class 1 | Class 2 | Class 3 | |

Student allocation | 1,3,5,8,4,7,3,4,2,9 | 6,3,4,2,6,1,2,1,7,10 | 9,2,6,3,7,4,10,3,6,2,4 |

As there is no discernible relationship between the groups it is obviously not possible to manipulate this data mathematically. You can’t add the numbers up for each category and you can’t take a mean. Just remember treat the numbers as if they aren’t numbers.

2) **Groups that are related in an irregular manner**– Consider the following, *completely different*, example.

Class 1 | Class 2 | Class 3 | |

Student allocation | 1,3,5,8,4,7,3,4,2,9 | 6,3,4,2,6,1,2,1,7,10 | 9,2,6,3,7,4,10,3,6,2,4 |

This represents a similar situation to before but, in this case; the students *have* been placed in groups according to *relative* ability, but still with no consistent mathematical relationship between groups. E.g. the students in group 2 should be better readers than those in groups 1 but it is impossible to suggest that they are twice as good. You can start to examine this data and make judgements using it, but you need to be careful. It’s not appropriate to indicate a mean. You couldn’t assume that group 5 represents average ability and you can’t make reliable judgements by adding numbers.

3) **Groups that are related in a regular manner, within a seperately defined range** –

In the real world, this kind of situation is more likely to occur with things like percentages of defined values, such as maximum possible height. Because these numbers are quantitatively related to each other you can take means and you can perform statistics.

You need to be aware however, that when values are expressed in relation to something else, that value will *change* as it does. It is impossible for a person to be at 110% of the *current* maximum human height, but they can be at 110% of what *was* the maximum human height in the year 1800 or 110% of the *average* human height. A bear can always be at 110% of the maximum human height because he isn’t a human.

4) **Groups that are related in a regular manner, but with an inconsistent zero. –**

The best of example of this is with temperature. Both the Centigrade and Fahrenheit scales have values that relate to each other in regular way but have arbitrary zero values. Therefore it is impossible to use such as scale to make assumptions about the relative absolute magnitude of values. 2^C is *not* twice as hot as 1^C. In these cases you *can not* express values as ratios of each other and you *can not* multiply and divide values (although you could convert the values to a scale in which it was permissible to do this, such as Kelvin for temperature). You *can* still calculate values for standard deviation and standard error and you *can* still perform a t-test.

5) **Values that are expressed in multiple units or irregular units. –**

Once again, this is common sense. The usual examples of this are the typical expression of times and dates. 8:05 is not the same as 8.05 hours, 0.30 of an hour is not equal to 30 minutes. You just need to make sure that you have converted data to a single unit for calculations, that you remember to convert them back again if necessary, and that you don’t convert units inappropriately. In particular you want to avoid using units that can have variable relationship with others such as months, or converting such values to other units without stating any assumption that you are making in doing so.

Speaking of assumptions…

You may have noticed that in the previous section I used the same set of data for two separate examples. This was not entirely a result of laziness. What I wanted to illustrate is that in all three examples, the information that you have available to you is completely eclipsed by the information that you don’t. When you don’t have information available to you, you are forced to replace it with assumption, assumptions are *bad*, and it’s not always obvious when you are making them.

Class 1 | Class 2 | Class 3 | |

Student allocation | 1,3,5,8,4,7,3,4,2,9 | 6,3,4,2,6,1,2,1,7,10 | 9,2,6,3,7,4,10,3,6,2,4 |

We’ve already discussed why it’s critical that you understand what each value represents and how they relate to the other values.

Some other important things that we don’t know about this data yet –

- Whether it encompasses all the classes in the school and all the students in each class.
- If it doesn’t, it doesn’t tell us what criteria were used to select the data we’ve been given.
- It doesn’t tell us whether the students were assigned to their groups by the same person and whether the criteria that should have been used to make the assignments were properly enacted.
- It doesn’t tell us when the allocation was made, if all the allocations were made at the same time, and whether they are still valid.
- It doesn’t tell us how the separate classes relate to each other, for example, it doesn’t tell us whether all the children are in the same year group.
- If classes in the schools are
*usually*distinguished by number it doesn’t confirm for us that the classes mentioned in the data are numbered using the same scheme. - It doesn’t identify any unusual circumstances that might apply to individual groups.

These are all important factors in evaluating this data, and if we don’t have this information available to us then we are forced to make assumptions.

In terms of good practice we should always make the most conservative assumption about the quality of data available, but in reality we aren’t always going to be able to do that and will have to try and weigh the likelihood that an assumption is false.

You will almost never be given all of the information that you *need*, but that doesn’t excuse you from thinking about the information that you have, considering the assumptions that you are making, obtaining further information if possible (and if you can’t get it from the client, that doesn’t necessarily mean you can’t get it from somewhere else.

If nothing else, when you have identified your assumptions you will be able to clearly *state* the assumptions that you are making in your work.

This right here is most of data analysis*, think about your data, make sure it means what you think it means, and identify the assumptions that you are making.* Almost all of the bad data analysis that I’ve seen, was doomed from the very beginning because of failures to do this.

Next Friday – *Size Matters*, talking about sample size and rambling about randomness

Hopefully, after the last article, you know *why* you are here.

Now we are going to start learning stuff for real.

Now we embark on our magical mystery tour, through stuff that is so *awesomely* fun that you’ve probably relearned and forgotten it several times already.

*Obviously* this time will be different.

You just groaned in despair, didn’t you?* *

*Mean*, *median*, and *mode*, are the three concepts which we get introduced to at school and then promptly mix up for the rest of our lives. They are also often associated with “*range*”, that we forget about completely because it doesn’t begin with M.

*Median* and *mode* represent concepts that most of us fail to absorb in school, and in most cases our lives have notably failed to be adversely affected. At this point you may even have come to the conclusion that you are better off leaving them well alone. If this is a prejudice that you have somehow acquired, then I’m right here to tell you that, well, you are pretty much right… mostly…

Let’s start things off with the *mean*, means *are* unambiguously important and if you somehow failed to get this one straight in school , we need to rectify that.

The *mean* is what most people think of as an *average*. It’s the sum of all the values in a dataset divided by the number of separate values.

If you are having trouble visualising this, imagine a set of containers with different quantities of liquid inside. When you take the mean, you are mixing all the liquid from all the containers together and then splitting it back up equally. Means are intrinsically unhygienic.

Because statistics is nothing if not needlessly confusing, whilst the mean is always the average, the *average* is not always a mean.

Strictly defined “average” refers to the most appropriate measurement of *central tendency*; it therefore can *also* refer to the *median* or *mode*.

*You don’t need to worry about this*, if you can legitimately calculate the *mean*, then the *average* is the *mean* anyway, and the chances that you will come across the term “*average*” used for something else, without this being explicitly indicated are small. Just try to make sure that you personally use “*mean*” unless you feel that your audience will be more comfortable with “*average*” or you happen to be playing scrabble.

More confusion can arise from the way that, usually, when we are talking about *means,* we are referring to one of two *different* numbers, they are both means, but they are means *of* different things.

The *sample* *mean* is that mean of the subset of data you are actually working on. The *population* *mean* is the true mean of the entire population that you drew that sample from. It’s possible for these to be the same if you have sampled the entire population, but, most often you will find yourself calculating a sample mean in order to try and predict the value of the population mean.

Like a lot of stuff with stats, the key to remembering this is to understand what the things are, rather than try to memorize the terms.

- Population mean = mean of the whole
*population* - Sample mean = mean of your
*sample* - Often you will already have the
*sample*mean, and be using it to predict the value of the*population*mean

When other people use these terms it should be clear from context, what they are talking about. Just try to ensure that your own work is clear in indicating *which* data the mean is associated with, and be ready to request clarification if other people haven’t done the same.

The *range,* as defined in maths*,* is chiefly troublesome on account of being difficult for blog authors to clearly describe without using the word range in its everyday sense, thus risking confusion.

This is because they are the same thing.

You can work out the *range* by subtracting the smallest value in a dataset from the largest.

20,30,90,40

MEAN = 45

RANGE= 70

If you regularly use the term range correctly in other contexts, you are almost certainly going to use it correctly in this context too.

This is the point where many people start to struggle.

The problem with the mode is that it is an intensely intuitive concept, which is often taught in an extremely unintuitive way. People will tend to use the mode automatically when appropriate, they just won’t realise that.

A lot of the problem stem from the way that the Mode is taught, as it usually bundled together with the median and mean, in order to use the same examples.

The mode is that value that occurs *most often* in a dataset, for example, in the number sequence below, the mode is 2

1,1,3,4,2,5,2,4,3,2

This is easy to *understand*, but doesn’t do a particularly good job of telling us *why* we need to use the *mode.*

That is because we *wouldn’t*.

It’s important to realise that while *mean*, *median*, and *mode* are *taught* together, they aren’t usually *used* together. For the sequence above, the *mean* is far more useful than the *mode*.

Consider the following example, however.

Taxi,taxi,bus,motorbike,car,helicopter, car, motorbike,bus, car

Which is more likely to be recorded in the form of a table, like so

Taxi |
Bus |
motorbike |
car |
helicopter |

2 |
2 |
2 |
3 |
1 |

Knowing that “car” occurs most often is of obvious value, but most people will automatically use data like this, without putting a term to what they are doing.

It’s worthwhile, to try and remember what the *mode* refers to, but it’s unlikely you will come across the term used much in the wild; it’s really just a descriptive term for something that is probably too simple to need one.

The median represent another straightforward concept. It’s the middle value of a sequence of numbers (or group of numbers that can be placed into sequence). If the sequence is even numbered then it’s usually taken to be the *mean* of the two middle numbers, although it can also be given as a pair of numbers.

e.g. – for the sequence 1,2,3,4,5 the *median* value is 3. For 1,2,3,4,5,6. The median value is 3.5.

The problem here is that, whilst determining the median is fairly straightforward, it is not a terribly useful value in and of itself, and it can be very easy for people to confuse a *median* value for a *mean*.

Some statistical operations will require a median value, so it is useful to know what one represents, but a medium value should not be presented by itself.

Ensure that you are clear on the precise meanings of *mean*, and *range*. Remember to specify *which data* a *mean* value refers to. Try not to use the term *average*, unless you believe that your audience may be unfamiliar with *mean*, as it is less precise.

Don’t worry about the *mode*, it’s just not a very helpful term for most audiences, as it represents a concept that is much more straightforwardly represented in a chart or table. In theory, it may be helpful to remember the definition in case you come across someone else referring to it, but I’d tend to suggest that it might be more productive to just pelt them with rotten vegetables instead.

It’s quite possible that you will need to determine a *median* value at some point, but unless you are confident that will be able to recall the precise definition, it might be sensible to look it up when you need it. *Do not* present a *median* value unless you have a clear reason to do so.

OK?

Now that I should be able to talk about mean numbers, without people worrying that I am projecting a tad too much, we can proceed to the next article, where we are going to talk about the starting point of analysis.

The *data*

This is, traditionally, also the point, at which analysis goes horribly, horribly, wrong.

]]>

This is the point when I’m probably supposed to break out the dictionary definition of statistics, but I’m not going to because it means pretty much what you probably think it does. What I am going to do though, in somewhat bewildered obeisance to the fact that there are now even specific clichés associated with stats tutorials is break out *the* quotation.

“*There are three kinds of lies, lies, damn lies, and statistics*”

Everyone has heard this one. It is usually attributed to Mark Twain, but the Internet informs me that its creation should instead be associated with one of a much broader range of historical people who, as with most famous quotations, almost certainly stole it from another witty but significantly less famous person anyway.

This is a great line, and you should be sure to share it with any statisticians that you know, especially if you are bigger than them and it wouldn’t be considered socially acceptable for them to hit you at that specific point in time.

And there is certainly an element of truth to it. Although is something very *wrong* about the thought that even mathematicians get opportunities to use their powers for evil, statistics can, and often is, abused. A lot of problems in this world can be attributed to the fact that statistics allows people to replace “because I said so” with the smug presentation of pages of very complicated sums.

When someone makes an argument with words then anyone can respond with more words, but when they make it with maths most people find themselves helpless. People don’t like this. Statistics allow disreputable people to mislead and stupid or lazy people to make mistakes that will go unnoticed because checking their work would involve a lot of work.

All too often then, we take this quotation to heart, repeat it to every statistician that we meet until they flee sobbing into the night, and use it to kindle a smug sense of superiority because, if there is one thing that we love, it’s opportunities to feel smugly superior about people we are worried are smarter than us.

There is just one huge problem with all of the above, and that is that all of the problems I just mentioned, are actually problems with people.

If there is a dark secret concerning statistics, it is simply this, *statistics works*. Statistics does exactly what it says on the tin, even if people seldom bother to actually read that tin and the labels can be swapped round by other unscrupulous people to sell you tins of caviar that will later turn out to be composed entirely of beans.

Happily, as you may have noted, this doesn’t really let the *statisticians* off the hook, so you can probably carry on feeling superior to them if you really want to.

So while I appreciate at this point you may not be experiencing any upswing in warm fuzzy feelings about statistics, I hope you can see a few reasons to actually learn more about it

First of all, you were *absolutely* correct to be suspicious of other people’s statistics, and I hate to break it to you, but if you want to find yourself in a position to identify and fend off bad data as it is presented to you, you will have to study this stuff yourself. In this respect, and only this respect, statistics are exactly like Jedi force powers.

Secondly, if you find yourself with data of your own to analyse, you have a lot to gain by using proper stats. Stats do work, using them is nothing to be ashamed of (honest), and no one is going to make you buy a pocket protector unless you really want to.

Stats will tell you whether your data tells what you think it does. *This is important*. And whilst, as I have previously suggested, finding out that your data doesn’t tell you it what you want it to be telling you is heart breaking, you may wish to consider that finding that out subsequent to the expenditure of another two years of budget can be heart *stopping*. And you need all the help you can get with this, because you have a brain that is terrible at this stuff.

And not just your brain, everybody’s brains. Your Brain is a miracle of evolution (or if you are American, sometimes theology) that has happened to develop in such a way as to be really bad at this kind of data analysis.

Specifically, your brain is good at searching for patterns in things, because finding patterns has been critical to the vast majority of all of the things that people have ever done. In the evolutionary sense patterns means prizes, and in many cases these prizes involve not being eaten or poisoned by things that have eaten or poisoned other people in the past, thus establishing *very important* *trends*. Your brain is much better at finding patterns than it is at *not* finding patterns. Conversely data manipulation often involves ignoring what *seem *to be obvious patterns in favour of voluntarily doing arithmetic. We can’t really blame Nature for not anticipating this.

People are especially bad at recognising or producing random data. You can investigate this yourself, if you ask people to give you a random number between one and five they will tend to give you a three, between one and ten, a seven, because these numbers feel the most random.

People also tend to seek out results or patterns that confirm their own prejudices or expectations. Taking the previous example, you’ll notice that I’ve not given a cite. The above factoid is a much discussed one, and it ties with my personal experience, but there is a danger here. I *can* find citations for this which offer experimental tests for this, but none of them actually met the standards for data-set size or results presentation I’m going to use in the rest of this article, and so I’m not going to use them.

The danger is that once something has become acknowledged as “fact” it tends to be reinforced. People will start running their own experiments and only reporting the results if they match their expectations. I started off searching for an example here by using search terms that would lead me to experimental confirmation of trends involving 3 and 7 but would probably miss other people whose experiments had instead found a bias for 1 and 2.

In this particular case I’m fairly confident the pattern is real and does exist, but it’s important to consider this issue, it’s much easier to find something that you expect to find. And it’s very tempting to use data that you don’t really trust because it supports your own position, especially if you can blame someone else if it turns out to be bad.

This leads us neatly to the really important bit.

You’ll notice that I’m distinguishing between data analysis and stats. In my mind (and I don’t really care if the dictionary agrees with me) data analysis is what you do, stats is one of the things that you do it with. Stats are tools, and they are only a subset of the tools that are available to you.

Picture the scene, you are in a room, in front of you is a very complicated machine, a set of mostly recognisable tools and an engineer. That engineer is going to stand silently watching you use *his,* very expensive, tools to disassemble the machine. You’ll most likely pick up something that looks vaguely like a wrench and move towards something that looks vaguely like a bolt, but first, please look carefully over towards the engineer, you’ll probably notice that the man is wincing.

Statistics comprises a lot of techniques that are used to perform a lot of tasks. Most of these are things that you couldn’t or shouldn’t do yourself, and that’s fine, because you are unlikely to need to. But even with the basic stuff it’s not enough to have a tool, you need to know how and when to use them.

The process of data analysis needs to follow the golden rules and the rules are very simple.

- Stop
- Think
- Maths (if you really have to)

You’ll notice that thinking comes first. The mistake that a lot of people make is assuming that it’s the maths that you need to be thinking about. It’s very easy to obsess about selecting the appropriate test to use and performing it properly. This is what a lot of statistics guides obsess on. That is certainly important, but there are a lot of other places that you can go wrong first.

This is why I told you at the start of this article that searching for a simple guide to the statistical test that you think you should be using is a bad idea, and that is why I’m going to spend a lot of time in the next couple of chapters talking not about statistics, but about the starting point, the data.

So join me next week for Part 3, *That stuff you hated in school*

This article is aimed at the average person who wants to know more about evaluating their data. It is an examination of the theory behind statistics and guide to some of the most obvious pitfalls and problems.

It is intended to form an accessible introduction to data analysis for people who don’t come from an analytical background such as journalists, social media account managers, and new graduate students. Essentially this is the guide that I wish that I had when I was starting my PhD.

I will be presenting some walkthroughs on basic statistical techniques, but I’m also going to spend quite a lot more talking about them. If you are the kind of person that is currently consolidating all your data into two means on a single bar chart and feely vaguely guilty about it then this is the guide for you, and I will do my best to make you feel considerably more guilty by the end of all of all this.

Statistics is a very complicated subject and this is not an *exhaustive* examination of it. I have worked quite a bit with stats and I have worked hard to check this article and point out the inevitable over simplifications as I make them, however my background is in science, I am not a statistician or a mathematician.

Obviously the standard disclaimer applies, if you are need to do something absolutely critical with data, you probably shouldn’t be relying solely on random blog articles. I’ll pause now so you can contemplate the wisdom of this, possibly make a mental note to double check anything you learn at some vaguely undefined point in the future, and then proceed as normal, because, after all, we all seem to learn everything from random blog articles these days.

A reasonable question to ask at this point then, probably involves why exactly you should read *this* guide. Why you shouldn’t look for something written by some distinguished statistician who is most certainly not going to make any mistakes, or at least none that won’t later form the subject of someone else’s thesis.

This is going to come down to what may seem to be a bewildering admission. I do not like stats.

I am not a natural mathematician, I do not enjoy sums. I am a biologist, I have had to work quite a lot with data and like many other biologists I did not enjoy doing so. My natural inclination when approaching data analysis is to do as little actual maths as possible. Like many other scientists I have sat at my keyboard, heart sinking as the application of a treacherous *sum* suddenly tells me that the results that I *needed* are suddenly beyond my reach again.

I do not like stats.

Why is this good?

Well for a start, when I tell you that you need to know about this, you can really believe me. I am not someone who is quick to resort to numbers but I’m about to spend quite a lot of time selling you on their importance. When I tell you that you need to apply a statistical test to something you can be assured that, in the same situation, I have probably devoted far too much mental energy into trying to avoid doing so.

It also means that I have something of an appreciation of the stumbling blocks for those amongst you that don’t intuitively understand numbers, which let’s face it, is pretty much everyone apart from the specific subset of people that typically write statistics guides.

I have an appreciation for the second part of this picture that far too many people miss. Data analysis is about much more than statistics and maths. They are important, but they are only part of the picture. More than anything else, the best and most effective way of reducing the amount of time you will have to spend doing stats, is to make sure that when you do, you are always doing the *right* stats using the *right* data.

I want to start off with some honesty. This is a basic guide to a complicated subject. There is only so much material that can be skipped over, before this whole endeavor becomes meaningless.

This isn’t a short series of articles, and some of the concepts are sufficently complicated that they will likely require some breaks and re-reading, but if they are in the guide, it’s because I honestly believe that you will need to understand them to perform useful data analysis.

Over all else, statistics is a subject that requires a solid grounding in the fundamentals, and the failure to achieve this will often doom your analysis before you have even finished assembling your *data*.

So, please try to work through these articles in the order that they are presented in. The subjects are presented in a particular order for a reason. There will probably be some repetition of concepts for a lot of readers, but I hope that there will be valuable material throughout, for just about everyone.

1) This Introduction – The one that you are reading now. In which I describe what I am trying to achieve with these articles.

2) So what are statistics, and what did they ever do for me? – A discussion of exactly what statistics is, why you should care about it. It is also going to talk about why data analysis is about a lot more than maths, charts and stats.

3) That stuff you hated in school – Is a reintroduction to basic statistical concepts such as the mean, medians, and modes (oh my). As well as the much neglected Range.

4) Assumptions, the root of all applicable evil – is going to talk about the importance of understanding your data, knowing where it came from and cultivating an awareness of the assumptions that you are making.

5) Size Matters – Why big groups are better. Talks about sample size, as well as distribution and normality, and formally introduces you to Variance.

6) More stuff that you hoped you’d never come across again – Is going to talk about basic data analysis, error bars, chart usage and spread sheet techniques as well as problems with the above. In a rare act of mercy, there will be an option to skip past the charts and spread sheets bits for people who are already familiar with them.

7) How to lie with statistics and why you shouldn’t – Is going to contain quite a lot of examples about how data can be presented in a misleading fashion, and why *you* shouldn’t do this.

8) I bet you never thought you’d look forwards to stats – Is going to involve actual statistics, primarily a discussion of t-tests and a walkthrough to performing them with Excel. It is possible but unlikely that the sense of anticipation that you will feel as a result of the delay in getting to this part will overrule the small detail that maths will have to happen.

9) Numbers in their natural environment- Will provide a series of randomised datasets that people can use to practice these techniques, and get a feel for realistic datasets as well as instructions for generating your own.

So, starting off

Why should you even care about statistics?

Join me in part 2 to find out.

]]>