Stats 3 – That stuff you hated in school

Part 3 of An astonishingly useful guide to data analysis for people that don’t like maths

 

Hopefully, after the last article, you know why you are here.

Now we are going to start learning stuff for real.

Now we embark on our magical mystery tour, through stuff that is so awesomely fun that you’ve probably relearned and forgotten it several times already.

Obviously this time will be different.

 

Mean, Median, and Mode

You just groaned in despair, didn’t you?

Mean, median, and mode, are the three concepts which we get introduced to at school and then promptly mix up for the rest of our lives. They are also often associated with “range”, that we forget about completely because it doesn’t begin with M.

Median and mode represent concepts that most of us fail to absorb in school, and in most cases our lives have notably failed to be adversely affected. At this point you may even have come to the conclusion that you are better off leaving them well alone. If this is a prejudice that you have somehow acquired, then I’m right here to tell you that, well, you are pretty much right… mostly…

Let’s start things off with the mean, means are unambiguously important and if you somehow failed to get this one straight in school , we need to rectify that.

 

A memorable memorandum, regarding the multifarious means by which a mean or malicious man could confuse your mind, mayhap  mentioning the many meanings of the multiple, most meaningful, manifestations of the mathematical mean

The mean is what most people think of as an average. It’s the sum of all the values in a dataset divided by the number of separate values.

If you are having trouble visualising this, imagine a set of containers with different quantities of liquid inside. When you take the mean, you are mixing all the liquid from all the containers together and then splitting it back up equally. Means are intrinsically unhygienic.

Because statistics is nothing if not needlessly confusing, whilst the mean is always the average, the average is not always a mean.

Strictly defined “average” refers to the most appropriate measurement of central tendency; it therefore can also refer to the median or mode.

You don’t need to worry about this, if you can legitimately calculate the mean, then the average is the mean anyway, and the chances that you will come across the term “average” used for something else, without this being explicitly indicated are small. Just try to make sure that you personally use “mean” unless you feel that your audience will be more comfortable with “average” or you happen to be playing scrabble.

 

More confusion can arise from the way that, usually, when we are talking about means, we are referring to one of two different numbers, they are both means, but they are means of different things.

The sample mean is that mean of the subset of data you are actually working on. The population mean is the true mean of the entire population that you drew that sample from. It’s possible for these to be the same if you have sampled the entire population, but, most often you will find yourself calculating a sample mean in order to try and predict the value of the population mean.

Like a lot of stuff with stats, the key to remembering this is to understand what the things are, rather than try to memorize the terms.

  • Population mean = mean of the whole population
  • Sample mean = mean of your sample
  • Often you will already have the sample mean, and be using it to predict the value of the population mean

When other people use these terms it should be clear from context, what they are talking about.  Just try to ensure that your own work is clear in indicating which data the mean is associated with, and be ready to request clarification if other people haven’t done the same.

 

Range

The range, as defined in maths, is chiefly troublesome on account of being difficult for blog authors to clearly describe without using the word range in its everyday sense, thus risking confusion.

This is because they are the same thing.

You can work out the range by subtracting the smallest value in a dataset from the largest.

20,30,90,40

MEAN = 45

RANGE= 70

If you regularly use the term range correctly in other contexts, you are almost certainly going to use it correctly in this context too.

 

The Mode, a concept too simple to teach?

This is the point where many people start to struggle.

The problem with the mode is that it is an intensely intuitive concept, which is often taught in an extremely unintuitive way. People will tend to use the mode automatically when appropriate, they just won’t realise that.

A lot of the problem stem from the way that the Mode is taught, as it usually bundled together with the median and mean, in order to use the same examples.

The mode is that value that occurs most often in a dataset, for example, in the number sequence below, the mode is 2

1,1,3,4,2,5,2,4,3,2

This is easy to understand, but doesn’t do a particularly good job of telling us why we need to use the mode.

That is because we wouldn’t.

It’s important to realise that while mean, median, and mode are taught together, they aren’t usually used together. For the sequence above, the mean is far more useful than the mode.

Consider the following example, however.

Taxi,taxi,bus,motorbike,car,helicopter, car, motorbike,bus, car

Which is more likely to be recorded in the form of a table, like so

 

Taxi

Bus

motorbike

car

helicopter

2

2

2

3

1

 

Knowing that “car” occurs most often is of obvious value, but most people will automatically use data like this, without putting a term to what they are doing.

It’s worthwhile, to try and remember what the mode refers to, but it’s unlikely you will come across the term used much in the wild; it’s really just a descriptive term for something that is probably too simple to need one.

 

The median, does not work well alone

The median represent another straightforward concept. It’s the middle value of a sequence of numbers (or group of numbers that can be placed into sequence).  If the sequence is even numbered then it’s usually taken to be the mean of the two middle numbers, although it can also be given as a pair of numbers.

e.g. – for the sequence 1,2,3,4,5 the median value is 3. For 1,2,3,4,5,6. The median value is 3.5.

The problem here is that, whilst determining the median is fairly straightforward, it is not a terribly useful value in and of itself, and it can be very easy for people to confuse a median value for a mean.

Some statistical operations will require a median value, so it is useful to know what one represents, but a medium value should not be presented by itself.

 

In summary

Ensure that you are clear on the precise meanings of mean, and range. Remember to specify which data a mean value refers to. Try not to use the term average, unless you believe that your audience may be unfamiliar with mean, as it is less precise.

Don’t worry about the mode, it’s just not a very helpful term for most audiences, as it represents a concept that is much more straightforwardly represented in a chart or table. In theory, it may be helpful to remember the definition in case you come across someone else referring to it, but I’d tend to suggest that it might be more productive to just pelt them with rotten vegetables instead.

It’s quite possible that you will need to determine a median value at some point, but unless you are confident that will be able to recall the precise definition, it might be sensible to look it up when you need it. Do not present a median value unless you have a clear reason to do so.

OK?

 

Now that I should be able to talk about mean numbers, without people worrying that I am projecting a tad too much, we can proceed to the next article, where we are going to talk about the starting point of analysis.

The data

This is, traditionally, also the point, at which analysis goes horribly, horribly, wrong.

 

 

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes

%d bloggers like this: