Posts tagged: stats

Useful resources for science fiction writers

The following links represent some of the best articles and resources I’ve come across while putting this blog together.

 

Articles about writing, by writers

Holly Lisle – writing blog with an extensive series of how to articles.

Charles Stross – Lots of useful material on his blog. Of particular note, “Common misconceptions about publishing

Catherynne Valente – Recent set of articles, also hosted at Charlies blog “A Numerical List of Loosely-Connected Thoughts on Writing” (2,3,4) and an additional article about digital publishing. Her own website is here.

Neil Gaiman – lost of advice on his site including,  “Advice to authors“, and an article here about literary agents, which also has a lot of additional links.

J. Steven York –  Excellent blog, which includes the article “Writers and other delusional people

 

Screenwriting

JohnAugust.com – Probably the best screenwriting blog. Has a great directory of it’s own resource articles.

Josh Friedman – Another excellent screenwriting blog.

GoIntoTheStory.com – Again, lots of very useful resources here, although not the most accessibly structured site.

Screenwritinginfo.org – A very nice overview of basic screenwriting concepts.

 

Technology

A trade secret, for those of you aren’t irredeemably geeky. If you need solid, well researched, reference texts about future technology or sci-fi stuff, you can do a lot worse than check out Role Playing Game (RPG) source books. The best of these, from a reference viewpoint, are probably those written for the Eclipse Phase and Transhuman Space settings, Star Hero, and the other GURPS source books (especially Space, Ultra-Tech, and Bio-Tech) These aren’t free resources, but can be very worthwhile purchases for any science fiction author. A lot of them are available inexpensively in pdf format, or can be bought cheaply second hand.

How Stuff Works  – By no means exhaustive, but the articles that are present tend to be well researched.

DARPA – Some interesting insight here into what the US government is throwing money at.

SETI projects – Again, great info on their funded programs.

Wired technology blogs – An asset to any writer’s feed reader. Especially Danger Room, Gadget Lab, and Wired Science

Slashdot – Turns out 80’s cyberpunk novels were pretty bang on the money about a lot of stuff.

Technology Review – The MIT technology blog, yet another daily source of plot hooks.

TED – Lot’s of great stuff here, if you’ve got the bandwidth.

 

 

English and Grammar

Become a better writer – At Writing World, linked separately below.

The Elements of Style – The classic reference work by William Strunk Jr., available online at Bartleby.com

Grammar Handbook – From the University of Illinois at Urbana-Champaign

Guide to grammar and styleJack Lynch

 

Other useful stuff

Overused story ideas – From Strange Horizons. Check here, ideally before you write the nine novel series.

National statistics – Official data here for the UK, US, Canada, and the EU. Also the CDC. Nielsen also publishes some very useful data publicly.

UNTERM – United Nations terminology database, attempting to keep track of context specific word usage in 8 different languages. This is where they send the linguists and database administrators who have been very bad.

The Urban Dictionary – A lot of content is NSFW, so be warned!

WolframAlpha – A computational knowledge engine. If you need to put a value to just about anything, this is the place to start to start.

Kate’s Onomastikon – A listing of common names for different cultures.

Writer Beware – Helping writers avoid scams.

Writing World – Lots more useful resources.

TVTropes – Probably becoming less useful over time, as fans continue to figure out new ways to shoehorn Naruto into every single category. Still very handy listing of common tropes, and in finding fiction that has used similar ideas to your own.

 

This document is a work in progress, so feel free to contact me or comment if you have links that you think I should include. I’m trying to be selective though, so please understand that I may not use your suggestion.

 

 

 

 

 

 

 

Stats 5 – Power to the people, why large groups are just better

Welcome back to An astonishingly useful guide to data analysis for people that don’t like maths in last week’s exciting episode you learned how to evaluate your data before you start doing maths with it. This week I’m sure you’ll be thrilled to discover that I’m going to be introducing you to some more important concepts.

This is because they are important.

 

Random facts –

Random data is random. People forget this, a lot.

This is especially important when you are using Mean values, simply because nothing about a sample mean offers any useful information about its reliability as an estimation of the population mean. It’s very easy for them to mislead.

 

The above chart illustrates the mean values of 5 separate groups of data. The data in each set was comprised of ten entirely random numbers between 1 and 10. The final bar indicates the mean of all 50 values.

In this case you can treat the population mean as 5, and each set as a separate sample group the mean of which is obviously the sample mean. The final bar treats all 5 sets as one single group of 50 values, and indicates the sample mean for that case.

As you can see, even with ten data-points in a group, some of the the sample mean values differ substantially from the expected population mean of 5.

Conventional wisdom suggests a minimum of three samples in each group before you start to work with data, but even with 10 samples a deviation of up to 20% from the expected sample means is apparent. If we treat all 50 values as a single sample, the sample mean almost exactly matches the population mean.

 

This is another set of mean values. There are 20 values here, representing the deviation from the expected population mean (50), of the sample mean of a set of 20 groups, each containing between 5 and 100 values. Each individual value was randomly generated between 1 and 100.

In other words, the graph does not show the sample means, it shows the extent to which they deviate from the expected population mean.

There are a couple of important things to take on board here.

  • A small group is not always associated with a high deviation.
  • The larger the group is, the less likely it is that the results will be skewed significantly by chance.

It is important to realise that it is still possible for a large group of  samples to return a mean that deviates substantially from the true population mean, it just becomes less likely as the sample size is increased. The issue is one of variance, which is one of the next concepts that I’m going to discuss. But before I do that we need to discuss something else.

 

Striving to be normal

You may have noticed that the examples of random data that I gave in the previous example seemed to result in more variability in the sample means than seemed intuitive. This is because of how it was distributed.

It’s probably worth taking the time here to be clear on the difference between random distribution and random generation. Randomly generated numbers are those produced as a result of an ostensibly random process, in this case Excel’s number generation function. The randomly generated values that I showed you before, are of uniform distribution, that is to say, the numbers have an equal chance to fall anywhere in the specified range.

One of the reasons that people are so bad at dealing with randomness is that it is not a common occurrence in real life. In reality, data tends to cluster around the sample mean. In fact, a lot of statistical procedures assume that data will follow what is called a normal distribution.

In the above chart each line of points represents a separate set of randomly generated data. The first line consists of 50 values that are uniformly distributed, the second, 50 normally distributed values. You can see that the first set spread out fairly evenly between 1-100, whereas the second set tend to cluster nearer the expected mean of 50, although they still represent a considerable range of values.

The pattern is not as neat as you might expect, because the values are still generated randomly. Which means that any pattern of data could have occurred. Excel could have delivered me 50 identical values of 100, for example, in both cases, it’s just (very) unlikely.

Most statistics programs will inform you if your data is not normally distributed, although they will often allow you to proceed anyway.

You need to be aware if your data is not normal. It may prevent the test from working properly, but it may also reveal problems with your data that you are not aware of.

 

Possible reasons why your data does not follow a normal distribution –

  • There is a genuine and reasonable expectation that data would be randomly distributed – For example, you are looking for evidence of bias in what should be a random system such as a roulette wheel.
  • Subgroups within your data – If your data shows distinct clusters of normality it suggests that the factor that you are investigating is not consistent across the sample group or target population. It might also imply that the data was collected or processed inconsistently or conglomerated from separate sets.
  • Exponential growth – A good example of this would be the way a story or joke is shared by Twitter users. Because each retweet increases the pool of people who can make further re-tweets, the most successful Twitter posts will tend to be dramatically more shared than the least. You can expect to see this pattern a lot with social media data.
  • Fraud – You need to be careful here, but it is not uncommon for people who are falsifying data, but who do not have an in depth knowledge of statistics to do this by generating a uniformly random number series within a specified range, rather than a normally distributed one. You should be suspicious of truly uniform distributions within a large set or subset of numbers unless there is a very good explanation. The scarcity of uniform distributions within nature means that, with large groups of data, it is difficult for them to happen by accident or as a result of innocent mistakes that don’t happen to involve random number generators. but…
  • It can still happen by chance – As previously stated, you will never be able to be absolutely certain about data unless you have all of it, and if you have all of it, you don’t need to make predictions about it.

Statistics  = Confidence not certainty

You can never get away from chance when you are working with samples. Statistics does not allow you to make definitive statements about what is happening, but it allows you to determine how confident that you can be in your predictions.

“My data is not normally distributed, what do I do?

Start by figuring out why your data isn’t normally distributed.

If your data is showing clusters of data, the best approach is going to involve trying to untangle the subgroups from each other, obviously this may reduce the sample size below the level required for good quality data.

In some cases, especially with data associated with exponential growth, data that isn’t normally distributed can be “transformed” to meet a normal distribution, however this is very context specific.

 

Variance

As you should now realise, the problems associated with small data-sets are a function of their increased exposure to chance.

You should also understand that representing data from a group of samples, just by indicating the mean values,  provides no information about the level of variability within that group.

Variance describes the extent to which individual values fall close to the sample mean.

How random happens

With the examples of random numbers given above, the numbers come from a machine, and because machines suck at random, these will result from a natural process of some type, which while not random either, will in turn be influenced by other processes, and so on until we have split enough hairs that quantum processes, that may actually be random, are involved, and at any rate it’s long since become impossible to keep track.

Remember that variance doesn’t spring into existence just because the universe hates you (although that’s my go-to explanation for a lot of other stuff). It come from those aforementioned external influence, which we call factors.

Generally if you are doing statistics, it’s because you wish to investigate one or more of these factors, but trying to do this doesn’t make all the other factors go away and stop influencing your samples, because even if the universe doesn’t hate you, it’s not about to start doing you any favours

 

Fitting it together

So..

Normality describes the pattern that data most often falls in, and variance describes how precisely this actually occured. The variance is determined by all of the factors acting on your sample values other than the one that you using to define your target population.

Considering the assumptions that you have made will often allow you to identify some of the most important factors ahead of time, understand their interaction with the one you wish to investigate and help you to assemble sample data that is more representative of the target population.

Variance reduces your ability to make accurate predictions about the target population from your sample values, and this in turn decreases your confidence in those predictions.

If you can’t identify or correct for the influence of additional factors your data, you can oppose the resulting variance by increasing the number of samples that you take.

 

Next time on Stats – Practical stuff will happen

So how do we indicate variance and quantify confidence?

Come back for the next article in this series, in which I will finally start to talk about the actual process of data manipulation, starting with calculation and appropriate use of standard deviation and standard error.

Stats 3 – That stuff you hated in school

Part 3 of An astonishingly useful guide to data analysis for people that don’t like maths

 

Hopefully, after the last article, you know why you are here.

Now we are going to start learning stuff for real.

Now we embark on our magical mystery tour, through stuff that is so awesomely fun that you’ve probably relearned and forgotten it several times already.

Obviously this time will be different.

 

Mean, Median, and Mode

You just groaned in despair, didn’t you?

Mean, median, and mode, are the three concepts which we get introduced to at school and then promptly mix up for the rest of our lives. They are also often associated with “range”, that we forget about completely because it doesn’t begin with M.

Median and mode represent concepts that most of us fail to absorb in school, and in most cases our lives have notably failed to be adversely affected. At this point you may even have come to the conclusion that you are better off leaving them well alone. If this is a prejudice that you have somehow acquired, then I’m right here to tell you that, well, you are pretty much right… mostly…

Let’s start things off with the mean, means are unambiguously important and if you somehow failed to get this one straight in school , we need to rectify that.

 

A memorable memorandum, regarding the multifarious means by which a mean or malicious man could confuse your mind, mayhap  mentioning the many meanings of the multiple, most meaningful, manifestations of the mathematical mean

The mean is what most people think of as an average. It’s the sum of all the values in a dataset divided by the number of separate values.

If you are having trouble visualising this, imagine a set of containers with different quantities of liquid inside. When you take the mean, you are mixing all the liquid from all the containers together and then splitting it back up equally. Means are intrinsically unhygienic.

Because statistics is nothing if not needlessly confusing, whilst the mean is always the average, the average is not always a mean.

Strictly defined “average” refers to the most appropriate measurement of central tendency; it therefore can also refer to the median or mode.

You don’t need to worry about this, if you can legitimately calculate the mean, then the average is the mean anyway, and the chances that you will come across the term “average” used for something else, without this being explicitly indicated are small. Just try to make sure that you personally use “mean” unless you feel that your audience will be more comfortable with “average” or you happen to be playing scrabble.

 

More confusion can arise from the way that, usually, when we are talking about means, we are referring to one of two different numbers, they are both means, but they are means of different things.

The sample mean is that mean of the subset of data you are actually working on. The population mean is the true mean of the entire population that you drew that sample from. It’s possible for these to be the same if you have sampled the entire population, but, most often you will find yourself calculating a sample mean in order to try and predict the value of the population mean.

Like a lot of stuff with stats, the key to remembering this is to understand what the things are, rather than try to memorize the terms.

  • Population mean = mean of the whole population
  • Sample mean = mean of your sample
  • Often you will already have the sample mean, and be using it to predict the value of the population mean

When other people use these terms it should be clear from context, what they are talking about.  Just try to ensure that your own work is clear in indicating which data the mean is associated with, and be ready to request clarification if other people haven’t done the same.

 

Range

The range, as defined in maths, is chiefly troublesome on account of being difficult for blog authors to clearly describe without using the word range in its everyday sense, thus risking confusion.

This is because they are the same thing.

You can work out the range by subtracting the smallest value in a dataset from the largest.

20,30,90,40

MEAN = 45

RANGE= 70

If you regularly use the term range correctly in other contexts, you are almost certainly going to use it correctly in this context too.

 

The Mode, a concept too simple to teach?

This is the point where many people start to struggle.

The problem with the mode is that it is an intensely intuitive concept, which is often taught in an extremely unintuitive way. People will tend to use the mode automatically when appropriate, they just won’t realise that.

A lot of the problem stem from the way that the Mode is taught, as it usually bundled together with the median and mean, in order to use the same examples.

The mode is that value that occurs most often in a dataset, for example, in the number sequence below, the mode is 2

1,1,3,4,2,5,2,4,3,2

This is easy to understand, but doesn’t do a particularly good job of telling us why we need to use the mode.

That is because we wouldn’t.

It’s important to realise that while mean, median, and mode are taught together, they aren’t usually used together. For the sequence above, the mean is far more useful than the mode.

Consider the following example, however.

Taxi,taxi,bus,motorbike,car,helicopter, car, motorbike,bus, car

Which is more likely to be recorded in the form of a table, like so

 

Taxi

Bus

motorbike

car

helicopter

2

2

2

3

1

 

Knowing that “car” occurs most often is of obvious value, but most people will automatically use data like this, without putting a term to what they are doing.

It’s worthwhile, to try and remember what the mode refers to, but it’s unlikely you will come across the term used much in the wild; it’s really just a descriptive term for something that is probably too simple to need one.

 

The median, does not work well alone

The median represent another straightforward concept. It’s the middle value of a sequence of numbers (or group of numbers that can be placed into sequence).  If the sequence is even numbered then it’s usually taken to be the mean of the two middle numbers, although it can also be given as a pair of numbers.

e.g. – for the sequence 1,2,3,4,5 the median value is 3. For 1,2,3,4,5,6. The median value is 3.5.

The problem here is that, whilst determining the median is fairly straightforward, it is not a terribly useful value in and of itself, and it can be very easy for people to confuse a median value for a mean.

Some statistical operations will require a median value, so it is useful to know what one represents, but a medium value should not be presented by itself.

 

In summary

Ensure that you are clear on the precise meanings of mean, and range. Remember to specify which data a mean value refers to. Try not to use the term average, unless you believe that your audience may be unfamiliar with mean, as it is less precise.

Don’t worry about the mode, it’s just not a very helpful term for most audiences, as it represents a concept that is much more straightforwardly represented in a chart or table. In theory, it may be helpful to remember the definition in case you come across someone else referring to it, but I’d tend to suggest that it might be more productive to just pelt them with rotten vegetables instead.

It’s quite possible that you will need to determine a median value at some point, but unless you are confident that will be able to recall the precise definition, it might be sensible to look it up when you need it. Do not present a median value unless you have a clear reason to do so.

OK?

 

Now that I should be able to talk about mean numbers, without people worrying that I am projecting a tad too much, we can proceed to the next article, where we are going to talk about the starting point of analysis.

The data

This is, traditionally, also the point, at which analysis goes horribly, horribly, wrong.

 

 

WordPress Themes