## Stats 5 – Power to the people, why large groups are just better

Welcome back to *An astonishingly useful guide to data analysis for people that don’t like maths* in last week’s *exciting* episode you learned how to evaluate your data before you start doing maths with it. This week I’m sure you’ll be thrilled to discover that I’m going to be introducing you to some more important concepts.

This is because they are *important*.

### Random facts –

Random data is random. People forget this, *a lot*.

This is especially important when you are using Mean values, simply because nothing about a sample mean offers any useful information about its reliability as an estimation of the population mean. It’s very easy for them to mislead.

The above chart illustrates the mean values of 5 separate groups of data. The data in each set was comprised of ten *entirely* random numbers between 1 and 10. The final bar indicates the mean of all 50 values.

In this case you can treat the *population mean* as 5, and each set as a separate sample group the mean of which is obviously the *sample mean*. The final bar treats all 5 sets as one single group of 50 values, and indicates the *sample mean* for that case.

As you can see, even with ten data-points in a group, some of the the *sample mean* values differ substantially from the expected *population mean* of 5.

Conventional wisdom suggests a minimum of three samples in each group before you start to work with data, but even with 10 samples a deviation of up to 20% from the expected sample means is apparent. If we treat all 50 values as a single sample, the sample mean almost exactly matches the population mean.

This is another set of mean values. There are 20 values here, representing the *deviation* from the expected *population mean* (50), of the *sample mean* of a set of 20 groups, each containing between 5 and 100 values. Each individual value was randomly generated between 1 and 100.

In other words, the graph does not show the *sample means*, it shows the extent to which they deviate from the expected *population mean.*

There are a couple of important things to take on board here.

- A small group is not
*always*associated with a high deviation. - The larger the group is, the less
*likely*it is that the results will be skewed significantly by chance.

It is important to realise that it is still possible for a large group of samples to return a mean that deviates substantially from the true population mean*,* it just becomes *less likely* as the sample size is increased. The issue is one of *variance*, which is one of the next concepts that I’m going to discuss. But before I do that we need to discuss something else.

### Striving to be normal

You may have noticed that the examples of random data that I gave in the previous example seemed to result in more variability in the sample means than seemed intuitive. This is because of how it was *distributed*.

It’s probably worth taking the time here to be clear on the difference between random distribution and random generation. Randomly *generated* numbers are those produced as a result of an ostensibly random process, in this case Excel’s number generation function. The randomly generated values that I showed you before, are of* uniform* distribution, that is to say, the numbers have an equal chance to fall anywhere in the specified range.

One of the reasons that people are so bad at dealing with randomness is that it is not a common occurrence in real life. In reality, data tends to cluster around the sample mean. In fact, a lot of statistical procedures *assume* that data will follow what is called a normal distribution.

In the above chart each line of points represents a separate set of randomly generated data. The first line consists of 50 values that are *uniformly* distributed, the second, 50 *normally* distributed values. You can see that the first set spread out fairly evenly between 1-100, whereas the second set tend to cluster nearer the expected mean of 50, although they still represent a considerable *range* of values.

The pattern is not as neat as you might expect, because the values are still *generated* randomly. Which means that any pattern of data could have occurred. Excel could have delivered me 50 identical values of 100, for example, in *both* cases, it’s just (very) unlikely.

Most statistics programs will inform you if your data is not normally distributed, although they will often allow you to proceed anyway.

You need to be aware if your data is not normal. It may prevent the test from working properly, but it may also reveal problems with your data that you are not aware of.

Possible reasons why your data does *not* follow a normal distribution –

- There is a genuine and reasonable expectation that data would be randomly distributed – For example, you are looking for evidence of bias in what
*should*be a random system such as a roulette wheel.

- Subgroups within your data – If your data shows distinct clusters of normality it suggests that the factor that you are investigating is not consistent across the sample group or target population. It might also imply that the data was collected or processed inconsistently or conglomerated from separate sets.

- Exponential growth – A good example of this would be the way a story or joke is shared by Twitter users. Because each retweet increases the pool of people who can make further re-tweets, the most successful Twitter posts will tend to be dramatically more shared than the least. You can expect to see this pattern a lot with social media data.

- Fraud – You need to be careful here, but it is not uncommon for people who are falsifying data, but who do not have an in depth knowledge of statistics to do this by generating a
*uniformly*random number series within a specified range, rather than a normally distributed one. You should be suspicious of*truly*uniform distributions within a large set or subset of numbers unless there is a very good explanation. The scarcity of uniform distributions within nature means that, with large groups of data, it is difficult for them to happen by accident or as a result of innocent mistakes that don’t happen to involve random number generators. but…

- It
*can*still happen by chance – As previously stated, you will never be able to be absolutely certain about data unless you have all of it, and if you have all of it, you don’t need to make predictions about it.

#### Statistics = Confidence not certainty

You can never get away from chance when you are working with samples. Statistics does not allow you to make definitive statements about what is happening, but it allows you to determine how *confident* that you can be in your predictions.

*“My data is not normally distributed, what do I do?*“

Start by figuring out why your data isn’t normally distributed.

If your data is showing clusters of data, the best approach is going to involve trying to untangle the subgroups from each other, obviously this may reduce the sample size below the level required for good quality data.

In some cases, especially with data associated with exponential growth, data that isn’t normally distributed can be “transformed” to meet a normal distribution, however this is very context specific.

### Variance

As you should now realise, the problems associated with small data-sets are a function of their increased exposure to chance.

You should also understand that representing data from a group of samples, just by indicating the mean values, provides no information about the level of variability *within* that group.

*Variance* describes the extent to which individual values fall close to the sample mean.

#### How random happens

With the examples of random numbers given above, the numbers come from a machine, and because machines suck at random, these will result from a natural process of some type, which while not random either, will in turn be influenced by other processes, and so on until we have split enough hairs that quantum processes, that may actually be random, are involved, and at any rate it’s long since become impossible to keep track.

Remember that variance doesn’t spring into existence just because the universe hates you (although that’s my go-to explanation for a lot of other stuff). It come from those aforementioned external influence, which we call *factors*.

Generally if you are doing statistics, it’s because you wish to investigate one or more of these *factors*, but trying to do this doesn’t make all the other factors go away and stop influencing your samples, because even if the universe doesn’t hate you, it’s not about to start doing you any favours

#### Fitting it together

So..

*Normality* describes the pattern that data most often falls in, and *variance* describes how precisely this actually occured. The variance is determined by all of the *factors* acting on your sample values other than the one that you using to define your target population.

Considering the *assumptions* that you have made will often allow you to identify some of the most important *factors* ahead of time, understand their interaction with the one you wish to investigate and help you to assemble sample data that is more representative of the target population.

Variance reduces your ability to make accurate predictions about the target population from your sample values, and this in turn decreases your *confidence* in those predictions.

If you can’t identify or correct for the influence of additional factors your data, you can oppose the resulting variance by increasing the number of samples that you take.

### Next time on Stats – Practical stuff will happen

So how do we indicate variance and quantify confidence?

Come back for the next article in this series, in which I will finally start to talk about the actual process of data manipulation, starting with calculation and *appropriate use* of standard deviation and standard error.