So, we’ve had our little refresher on mean, medians, and modes, Now we are going to look at the data, the starting point for your analysis, and very often the starting point for your screw ups.
Categories of data
Data can be broken down in to a broad range of different categories, all of which have distinct names; and this has a big impact on what you can do with it.
In deference to my lack of faith in your ability to remember names and my continued scepticism as to the actual real world value of remembering what things are called, I’m not going to approach this in the traditional manner and list them all.
What’s important is that you think about your data. If you think about your data, what it represents, and what you are intending to do with it, there are actually only a few situations in which you are likely to run into trouble.
This is mostly intuitive stuff. For example, a lot of the potential problems involve non-numerical data. Consider the previously used example of data regarding the mode of transport used to get to a hospital, it is not appropriate to do most statistical operations on this data, but it is also conveniently impossible to perform most mathematical operations on it, making this very easy to remember.
You can calculate a mode for this data, but we’ve already discussed why you probably won’t need reminding to do this.
There are some data types that you need to watch out for however.
1) Numbers that aren’t really numbers – for example numbered groups. Consider a school that has divided its students into ten separate reading groups arbitrarily.
|Class 1||Class 2||Class 3|
As there is no discernible relationship between the groups it is obviously not possible to manipulate this data mathematically. You can’t add the numbers up for each category and you can’t take a mean. Just remember treat the numbers as if they aren’t numbers.
2) Groups that are related in an irregular manner– Consider the following, completely different, example.
|Class 1||Class 2||Class 3|
This represents a similar situation to before but, in this case; the students have been placed in groups according to relative ability, but still with no consistent mathematical relationship between groups. E.g. the students in group 2 should be better readers than those in groups 1 but it is impossible to suggest that they are twice as good. You can start to examine this data and make judgements using it, but you need to be careful. It’s not appropriate to indicate a mean. You couldn’t assume that group 5 represents average ability and you can’t make reliable judgements by adding numbers.
3) Groups that are related in a regular manner, within a seperately defined range –
In the real world, this kind of situation is more likely to occur with things like percentages of defined values, such as maximum possible height. Because these numbers are quantitatively related to each other you can take means and you can perform statistics.
You need to be aware however, that when values are expressed in relation to something else, that value will change as it does. It is impossible for a person to be at 110% of the current maximum human height, but they can be at 110% of what was the maximum human height in the year 1800 or 110% of the average human height. A bear can always be at 110% of the maximum human height because he isn’t a human.
4) Groups that are related in a regular manner, but with an inconsistent zero. –
The best of example of this is with temperature. Both the Centigrade and Fahrenheit scales have values that relate to each other in regular way but have arbitrary zero values. Therefore it is impossible to use such as scale to make assumptions about the relative absolute magnitude of values. 2^C is not twice as hot as 1^C. In these cases you can not express values as ratios of each other and you can not multiply and divide values (although you could convert the values to a scale in which it was permissible to do this, such as Kelvin for temperature). You can still calculate values for standard deviation and standard error and you can still perform a t-test.
5) Values that are expressed in multiple units or irregular units. –
Once again, this is common sense. The usual examples of this are the typical expression of times and dates. 8:05 is not the same as 8.05 hours, 0.30 of an hour is not equal to 30 minutes. You just need to make sure that you have converted data to a single unit for calculations, that you remember to convert them back again if necessary, and that you don’t convert units inappropriately. In particular you want to avoid using units that can have variable relationship with others such as months, or converting such values to other units without stating any assumption that you are making in doing so.
Speaking of assumptions…
Assumptions are the root of some evil, and a lot of fuckups
You may have noticed that in the previous section I used the same set of data for two separate examples. This was not entirely a result of laziness. What I wanted to illustrate is that in all three examples, the information that you have available to you is completely eclipsed by the information that you don’t. When you don’t have information available to you, you are forced to replace it with assumption, assumptions are bad, and it’s not always obvious when you are making them.
|Class 1||Class 2||Class 3|
We’ve already discussed why it’s critical that you understand what each value represents and how they relate to the other values.
Some other important things that we don’t know about this data yet –
- Whether it encompasses all the classes in the school and all the students in each class.
- If it doesn’t, it doesn’t tell us what criteria were used to select the data we’ve been given.
- It doesn’t tell us whether the students were assigned to their groups by the same person and whether the criteria that should have been used to make the assignments were properly enacted.
- It doesn’t tell us when the allocation was made, if all the allocations were made at the same time, and whether they are still valid.
- It doesn’t tell us how the separate classes relate to each other, for example, it doesn’t tell us whether all the children are in the same year group.
- If classes in the schools are usually distinguished by number it doesn’t confirm for us that the classes mentioned in the data are numbered using the same scheme.
- It doesn’t identify any unusual circumstances that might apply to individual groups.
These are all important factors in evaluating this data, and if we don’t have this information available to us then we are forced to make assumptions.
In terms of good practice we should always make the most conservative assumption about the quality of data available, but in reality we aren’t always going to be able to do that and will have to try and weigh the likelihood that an assumption is false.
You will almost never be given all of the information that you need, but that doesn’t excuse you from thinking about the information that you have, considering the assumptions that you are making, obtaining further information if possible (and if you can’t get it from the client, that doesn’t necessarily mean you can’t get it from somewhere else.
If nothing else, when you have identified your assumptions you will be able to clearly state the assumptions that you are making in your work.
This right here is most of data analysis, think about your data, make sure it means what you think it means, and identify the assumptions that you are making. Almost all of the bad data analysis that I’ve seen, was doomed from the very beginning because of failures to do this.
Next Friday – Size Matters, talking about sample size and rambling about randomness