## Stats 3 – That stuff you hated in school

Part 3 of *An astonishingly useful guide to data analysis for people that don’t like maths*

Hopefully, after the last article, you know *why* you are here.

Now we are going to start learning stuff for real.

Now we embark on our magical mystery tour, through stuff that is so *awesomely* fun that you’ve probably relearned and forgotten it several times already.

*Obviously* this time will be different.

### Mean, Median, and Mode

You just groaned in despair, didn’t you?* *

*Mean*, *median*, and *mode*, are the three concepts which we get introduced to at school and then promptly mix up for the rest of our lives. They are also often associated with “*range*”, that we forget about completely because it doesn’t begin with M.

*Median* and *mode* represent concepts that most of us fail to absorb in school, and in most cases our lives have notably failed to be adversely affected. At this point you may even have come to the conclusion that you are better off leaving them well alone. If this is a prejudice that you have somehow acquired, then I’m right here to tell you that, well, you are pretty much right… mostly…

Let’s start things off with the *mean*, means *are* unambiguously important and if you somehow failed to get this one straight in school , we need to rectify that.

#### A memorable memorandum, regarding the multifarious means by which a mean or malicious man could confuse your mind, mayhap mentioning the many meanings of the multiple, most meaningful, manifestations of the mathematical mean

The *mean* is what most people think of as an *average*. It’s the sum of all the values in a dataset divided by the number of separate values.

If you are having trouble visualising this, imagine a set of containers with different quantities of liquid inside. When you take the mean, you are mixing all the liquid from all the containers together and then splitting it back up equally. Means are intrinsically unhygienic.

Because statistics is nothing if not needlessly confusing, whilst the mean is always the average, the *average* is not always a mean.

Strictly defined “average” refers to the most appropriate measurement of *central tendency*; it therefore can *also* refer to the *median* or *mode*.

*You don’t need to worry about this*, if you can legitimately calculate the *mean*, then the *average* is the *mean* anyway, and the chances that you will come across the term “*average*” used for something else, without this being explicitly indicated are small. Just try to make sure that you personally use “*mean*” unless you feel that your audience will be more comfortable with “*average*” or you happen to be playing scrabble.

More confusion can arise from the way that, usually, when we are talking about *means,* we are referring to one of two *different* numbers, they are both means, but they are means *of* different things.

The *sample* *mean* is that mean of the subset of data you are actually working on. The *population* *mean* is the true mean of the entire population that you drew that sample from. It’s possible for these to be the same if you have sampled the entire population, but, most often you will find yourself calculating a sample mean in order to try and predict the value of the population mean.

Like a lot of stuff with stats, the key to remembering this is to understand what the things are, rather than try to memorize the terms.

- Population mean = mean of the whole
*population* - Sample mean = mean of your
*sample* - Often you will already have the
*sample*mean, and be using it to predict the value of the*population*mean

When other people use these terms it should be clear from context, what they are talking about. Just try to ensure that your own work is clear in indicating *which* data the mean is associated with, and be ready to request clarification if other people haven’t done the same.

#### Range

The *range,* as defined in maths*,* is chiefly troublesome on account of being difficult for blog authors to clearly describe without using the word range in its everyday sense, thus risking confusion.

This is because they are the same thing.

You can work out the *range* by subtracting the smallest value in a dataset from the largest.

20,30,90,40

MEAN = 45

RANGE= 70

If you regularly use the term range correctly in other contexts, you are almost certainly going to use it correctly in this context too.

#### The Mode, a concept too simple to teach?

This is the point where many people start to struggle.

The problem with the mode is that it is an intensely intuitive concept, which is often taught in an extremely unintuitive way. People will tend to use the mode automatically when appropriate, they just won’t realise that.

A lot of the problem stem from the way that the Mode is taught, as it usually bundled together with the median and mean, in order to use the same examples.

The mode is that value that occurs *most often* in a dataset, for example, in the number sequence below, the mode is 2

1,1,3,4,2,5,2,4,3,2

This is easy to *understand*, but doesn’t do a particularly good job of telling us *why* we need to use the *mode.*

That is because we *wouldn’t*.

It’s important to realise that while *mean*, *median*, and *mode* are *taught* together, they aren’t usually *used* together. For the sequence above, the *mean* is far more useful than the *mode*.

Consider the following example, however.

Taxi,taxi,bus,motorbike,car,helicopter, car, motorbike,bus, car

Which is more likely to be recorded in the form of a table, like so

Taxi |
Bus |
motorbike |
car |
helicopter |

2 |
2 |
2 |
3 |
1 |

Knowing that “car” occurs most often is of obvious value, but most people will automatically use data like this, without putting a term to what they are doing.

It’s worthwhile, to try and remember what the *mode* refers to, but it’s unlikely you will come across the term used much in the wild; it’s really just a descriptive term for something that is probably too simple to need one.

#### The median, does not work well *alone*

The median represent another straightforward concept. It’s the middle value of a sequence of numbers (or group of numbers that can be placed into sequence). If the sequence is even numbered then it’s usually taken to be the *mean* of the two middle numbers, although it can also be given as a pair of numbers.

e.g. – for the sequence 1,2,3,4,5 the *median* value is 3. For 1,2,3,4,5,6. The median value is 3.5.

The problem here is that, whilst determining the median is fairly straightforward, it is not a terribly useful value in and of itself, and it can be very easy for people to confuse a *median* value for a *mean*.

Some statistical operations will require a median value, so it is useful to know what one represents, but a medium value should not be presented by itself.

### In summary

Ensure that you are clear on the precise meanings of *mean*, and *range*. Remember to specify *which data* a *mean* value refers to. Try not to use the term *average*, unless you believe that your audience may be unfamiliar with *mean*, as it is less precise.

Don’t worry about the *mode*, it’s just not a very helpful term for most audiences, as it represents a concept that is much more straightforwardly represented in a chart or table. In theory, it may be helpful to remember the definition in case you come across someone else referring to it, but I’d tend to suggest that it might be more productive to just pelt them with rotten vegetables instead.

It’s quite possible that you will need to determine a *median* value at some point, but unless you are confident that will be able to recall the precise definition, it might be sensible to look it up when you need it. *Do not* present a *median* value unless you have a clear reason to do so.

OK?

Now that I should be able to talk about mean numbers, without people worrying that I am projecting a tad too much, we can proceed to the next article, where we are going to talk about the starting point of analysis.

The *data*

This is, traditionally, also the point, at which analysis goes horribly, horribly, wrong.