The cans of soup are a discreet variable (also, a nominal variable).
How about you're doing an inventory on christmas trees that you have for sale:
size of tree
number of trees
0-3 ft.
17
3-6 ft.
123
7-9 ft.
87
You'll notice that because height of tree is a continuous value (also a ratio), we can't list every possible tree height, so we create arbitrary ranges.
f is a symbol that means frequency of occurrence, which means "how many times have we seen it" or "how many of these do we have". So, if you want to make a smartass distribution, instead of saying "number of cans" say "f". Another thing you can do to fancy up your table is to put a relative distribution. Relative just means, compared to the whole. In a relative distribution, instead of just counting how many times each thing occurs, we coudl what percentage of the total it is. Like this:
type of soup
f
relative f
Mushroom
15
50%
Chicken Noodle
11
36.66%
Clam Chowder
4
13.33%
If you want to make it even fancier, you can make it a cumulative frequency distribution. When you do this, you put in a column that says how many things are equal to or less than the current group. You can even add a cumulative percentage frequency, which is just what percentage the cumulative frequency is of the whole. As an example:
The more times you flipped, the less jagged the curve would be. And if you used an infinite number of coins and flipped them an infinite number of times, you would get a perfect normal curve (sometimes called a bell curve).
So why are normal curves so great? Normal curves (or at least approximations of normal curves) show up in a lot of places, from height to test scores. Think of something that occurs a lot, go out and measure it and graph it, and you'll probably end up with a normal curve. Statisticians have come up with tables that tell us all about normal curves. We can say that on this particular point of the curve, 95% of the population will be greater and 5% will be less.
In order to fit a normal curve to some real live thing, we have to knwo two things: the mean and the standard deviation. The man tells us where the center of our curve is. If the average height of people is 5'2" then 5'2" will be dead in the center of our normal curve. The standard deviation tells us the scale we're working with. Let's take two normal curves: the height of American five year olds, and the height of all minors. They might have the same mean - the average height of these two groups might be exactly the same. Yet the variance is very different: 95% of all 5 year olds will probably be within 6 inches of the mean, while 95% of all minors will probably be within 2 ft. of the mean. On a normal curve graph, you've got little measurements on the bottom. When there's more variance, those measurements "stretch out" and when there's less they "squish together."
The most commonly used Statistical Test is the t-test. The purpose of a t-test is to figure out if the difference between two means is caused by random chance or by an actual difference between the two populations. Or, more precisely, the t-test determines if the probability that the differences we found are due to simple variation is more or less than our alpha.
Let's say we have two means. One is "average amount of improvement among heart patients that sat in a room by themselves," another is "average amount of improvement among heart patients who sat ina room with leeches on them." The first is 7, the second is 15. Are we 95% sure that this wasn't just random variation between the two groups of people that had nothing to do with leeches? A t-test will tell us.
Now, I can hear you asking "Wow, how does the t-test do that?" The t-test can do that because it knows a special magic trick that I'm going to reveal to you. Zando the magician is at a party with a bunch of kids and adults. Now, let's say that the ages of the people at the party are not a normal curve. Let's say it looks something like this: Now, what Zando does is put everyone's name in a hat. He picks out two names, averages the ages of the two people, then puts the names back in the hat. He graphs that average on a frequency distribution. Then he does it again and again about 100 times. Now, no matter than the frequency graph of the original population looked like, what he's going to end up with will be a beautiful normal curve. He turned a non-normal curve in to a normal curve. How does he do it? There's no trick, it's actual magic!
Say Zando wanted to prove that the people drinking Kool-Aid were younger than the people drinking Tequila at the party. He takes a random sample of two kool-aid drinkers and two Tequila drinkers. He finds that the mean age for the kool-aid drinkers sample was less than the mean age for the Tequila drinkers sample, 7 and 30. It seems like he might be right, but what if the results he got are due to random variation? Assuming HO, that there is no difference between Kool-Aid and Tequila drinkers, what is the chance of picking a sample with a mean of 7 and a sample with a mean of 30? All Zando has to do is look on his magical normal curve and see where 7 and 30 are. Using a normal curve table, he can see what proportion of cases lie below 7 and above 30. And, if you remember probabilities, the probability of something happening is the proportion that it happens in an infinite number of tries, which is exactly what our normal curve is telling us. So, let's say that our normal curve tells us that the chance of getting a 7 or less and then a 30 or more (using the multiplication rule of probability) is .0371, less than our pre-set alpha of .05. Now, we can be sure at a .05 level of significance that there is a difference in age between tequila and kool-aid drinkers.
This is pretty much what a t-test does, you feed it your samples, it assumes a normal curve for possible samples if the HO is true, then it goes on to find out how unlikely your results are. If they're really unlikely, you can reject HO and accept Ha.
There are a few types of t-tests which are pretty much the same. The first is a one-sample t-test. We use this when we want to compare the mean of our sample with a mean for the population that we already know is true. For instance, what if we only wanted to prove that tequila drinkers are older than the mean age of the population of the party, and we've already figured out what the mean age for the population is, it's 10. Our HO would be that and our HA that . To run this test, we need to know the mean of X (our sample scores for tequila drinkers), the population mean we're comparing it to (10), and the standard deviation of our estimated sample error of the mean (which is the standard deviation divided by the square root of the n of the sample). Given all this, we can find out the t, which is sort of like the z score of our sample mean in the normal curve of possible sample means. Once we have a t-score, we can go to a t-score table and find out what t-score we need at our sample size and alpha to be able to reject HO. If the t-score we calculated is larger than the t score in book, we can reject HO and accept HA. T-scores are listed on a t-score table by degrees of freedom. For single mean t-tests, the degrees of freedom (df) is equal to n-1.
Our equations for this test are:
equals the population mean and the mean our sample should be, as predicted by HO.
The next type of t-test we can do is a t-test for unrelated means. This is the same as our test to find out whether the tequila drinkers at the birthday party had a higher age than the kool-aid drinkers (or, merely a different age if we want to use a two-tailed test). In this test we compare the means of two samples to see if they're different enough to reject HO. To do this test, we need to know the mean of both samples ( and ), the standard error of the difference between the means (), the sum of scores for each sample ( and ), the sum of the squares of each sample ( and ), and the pooled sum of squares (SSp). The equations we need will look like this:
The zero you see above represents the difference between means predicted by the HO (usually none). The degrees of freedom for a t-test of unrelated means is the sum of the n's of both samples minus 2.
The third type of t-test we'll consider right now is a t-test for related means. This is where we reduce the variance by matching scores together in to pairs. This can be done in two ways. The first is to use one subject and measure them twice under two conditions. If we want to find out if drinking jello shots changes intelligence, we can give subjects an intelligence test and record their score, give them a few jello shots, then test them again. The before and after scores for each subject would be a pair. Or, what if we want to find out if a person's intelligence is different while they're being hit on the head with a mallet. We find subjects who score the same on intelligence tests and match them up in pairs, then give one an intelligence test with no mallets, and give the other an intelligence test while we hit them on the head with a mallet. The non-mallet score and mallet score would be a pair of scores. To do a t-test for dependent means, we need to know the n (this time it's the number of pairs of scores), the difference between each pair of scores (D), the square of those differences (D2), the sum of D (), the sum of the squares of D () and the estimate of the standard error of D. These are the equations we'll use:
Once again, the 0 is the difference between means predicted by the HO. The degrees of freedom is the number of pairs minus 1.
Test: You've developed a drug which lets people see beyond the curtain of this reality and see the inhuman monsters that stalk us and eat our souls when we die. You've taken 100 people, given half of them the drug, and asked them to all report the number of monsters they see in a week. The group who didn't took the drug saw, on average, .6 monsters each. The group who did take the drug saw, on average, 2.1 monsters each. Your HO is that the drug does not change monster sightings. Your HA is that the drug does change monster sightings. What are you hoping for?
A. That the sample size is small enough, the variation between the two groups and within each group is large enough that out T score will be higher than .95 so we can reject both the null and alternate hypotheses.
B. That the sample size is large enough, the variance between the two groups is small enough, the variance within the groups is small enough, that our T score will be less than that required by an alpha of .05 and we can reject the alternate hypothesis.
C. That the variance within the two groups is low enough, the difference between the two groups is high enough and the sample size is large enough to give us a T score that meets an alpha of .05 and we can reject the null hypothesis.
D. To burn all my research notes, destroy all samples of the drug and drink until I forget about this damned experiment.
If you liked this tutorial, there's a .0031 probability that you'll be willing to buy my sci-fi Role Playing Game Fates Worse Than Death, so get out your credit card now!