CHAPTER II-2
TESTING HYPOTHESES
The concept of sameness was said in Chapter 00 to be the
basic idea in all of knowledge-getting, and the use of the
assumption of sameness is the most valuable heuristic we have for
the process of inference. Following on this, it obviously is of
great importance that one be able to say with some confidence
whether two collections of things are or are not the same for
one's current purposes.
In most situations in which we find ourselves, it is quite
clear without refined inquiry whether or not it is sensible to
treat the collections or items as the same. When you see bags of
potato chips on the shelf of a supermarket, you don't bother to
examine which bag has the more or better-shaped chips, though
some discriminating purchasers examine individual pieces of fresh
fruit at the greengrocer; any possible difference between the
potato-chip bags is too small for you to be concerned with, you
are sure. But when you see a group of Japanese tourists born in
the 1930s or earlier, you are sure that their heights are far
less on average than the next group of Japanese tourists who were
born in the 1960s. And you do not need to collect data or
consult previous studies to know that ingested alcohol has a
powerful effect on the human constitution and behavior.
From time to time, however, situations arise wherein we are
in doubt about whether two collections or items should be
considered the same or different (whether the heights of two
groups of tourists, say) or whether one "outlier" should be
considered to be from the same collection as other observations
(of a planet, say, one of the earliest problems in statistical
inference). To help us make a determination, we call upon one of
the main procedures in statistical inference - the testing of
hypotheses.
The logic of hypothesis testing is the subject of this
chapter. This logic is relatively uncomplicated and
uncontroversial compared to the logic of confidence intervals,
discussed in Chapters 00. Yet it, too, cannot be done as a
matter of routine, and requires judgment.
The first published formal test of a hypothesis was by John
Arbuthnot, doctor to Queen Anne of England, who thereby becomes
the father of formal statistical inference in 1710. He observed
that more boys than girls are born, which he assumed is necessary
for the survival of the species, and he wished to prove that
birth sex is indeed not a 50-50 probability. The records for
London showed that male births exceeded female 82 years in a row.
Arbuthnot therefore set forth to (in modern language) test the
hypothesis that a universe with a 50-50 probability of producing
males could result in 82 successive years with preponderantly
male births.
This is a canonical problem. You have some observed "sam-
ple" data, and you want to connect them to some specified "popu-
lation" from which they may have come. The previous sentence was
purposely worded vaguely because statistical questions can be
stated in many different ways. But in this case statisticians
agree on how to proceed: Specify the universe, and compare its
behavior against the observed sample. If it is unlikely that a
sample as surprising as the observed sample should come from the
specified universe, conclude that the sample did not come from
that universe. Chapter III-1 describes how Arbuthnot went about
it.
The practical business of carrying out a statistical
inference begins with the translation of a general question into
a scientific question, and thence into a question amenable to
statistical treatment; this process of translation, common to all
scientific inference, is discussed in Chapter 00. The general
procedure for the probabilistic manipulation carried out in the
context of a statistical inference, which pertains to all
statistical inference and not just to testing hypotheses, is set
forth below. (The subject was introduced in Chapter I-1.) The
overall procedure for a statistical inference, from the
translation of the question to the conclusion, can be framed in a
long series of questions and answers about the nature of the
universe(s) and sample(s), the probabilistic manipulation, and
then interpretation. The canonical series of these questions and
answers for testing hypotheses is presented in this chapter, and
the series of questions for finding confidence intervals is in
Chapter 00.
THE STEPS IN STATISTICAL INFERENCE
These are the steps in conducting statistical inference:
1. Frame a question in the form of: What is the chance of
getting the observed sample s from some specified population S?
The postulated universe S bears some likeness to the model
created by the researcher against which to test the observed
data. But instead of deriving from theory, insight, hunch,
whatever, in inference the model derives from the sample (plus
perhaps a Bayesian prior).
Another difference from the "scientific" model is that the
postulated universe S has no causal connection to sample s except
the process of (random?) sampling.
Universe S is like a scientific model in that it is assumed
not to be a perfect picture of nature. But unlike a scientific
model, in the case of a finite universe we assume that larger and
larger samples can approach the actual universe.
2. Reframe the question in the form of: What kinds of
samples does population S produce, with which probabilities?
That is, what is the probability of the observed sample s given
that a population is S? Or, what is p(s!S)?
3. Actually investigate the behavior of S with respect to s
and other samples. This can be done in two ways:
a. Use the calculus of probability ("math"), perhaps
resorting to the Monte Carlo method if an appropriate formula
does not exist. Or
b. Resampling (in the larger sense), which equals the Monte
Carlo method minus its use for approximations, investigation of
complex functions in statistics and other theoretical
mathematics, and non-resampling uses elsewhere in science.
Resampling in the more restricted sense includes bootstrap,
permutation, and other non-parametric methods. More about the
resampling procedure follows the paragraphs to come, and then in
later chapters in the book.
4. Interpret the probabilities that result from step 3 in
terms of acceptance or rejection of hypotheses, surety of
conclusions, and as inputs to decision theory.
The following short definition of statistical inference
summarizes the previous four steps: statistic inference equals
the selection of a probabilistic model to resemble the process
you wish to investigate, the investigation of that model's
behavior, and the interpretation of the results.
STEPS IN ESTIMATION OF STATISTICAL PROBABILITIES BY RESAMPLING
Stating the steps to be followed in a procedure is an
operational definition of the procedure. My belief in the
clarifying power of this device is embodied in the several sets
of steps given in this chapter for the various aspects of
statistical inference. This section sets forth the steps if the
computation of the probabilities if the inference will be done
with resampling. More detail may be found in the rest of this
chapter, and in Chapter 00.
Let us define resampling in a fashion that will include not
only problems in inferential statistics but also problems in
probability, as follows: Using the entire set of data you have in
hand, or using the given data-generating mechanism (such as a
die) that is a model of the process you wish to understand,
produce new samples of simulated data, and examine the results of
those samples. In some cases, it may also be appropriate to
amplify this procedure with additional assumptions.
Problems in pure probability may at first seem different in
nature than problems in statistical inference. But the same
logic as stated in this definition applies to both varieties of
problems. The difference is that in probability problems the
"model" is known in advance -- say, the model implicit in a deck
of poker cards plus a game's rules for dealing and counting the
results -- rather than the model being assumed to be best
estimated by the observed data, as in resampling statistics.
The following general procedure simulates what we are doing
when we estimate a probability using resampling operations:
Step A. Construct a simulated "universe" of cards or dice
or some other randomizing mechanism whose composition is similar
to the universe whose behavior we wish to describe and
investigate. The term "universe" refers to the system that is
relevant for a single simple event. For example:
A coin with two sides, or two sets of random numbers "1-105"
and "106-205", simulates the system that produces a single male
or female birth, when we are estimating the probability of three
girls in the first four children or nine female calves in ten
births (the problem to be treated below.) Notice that in this
universe the probability of a female remains the same from trial
event to trial event -- that is, the trials are independent --
demonstrating a universe from which we sample with replacement.
Step(s) B. Specify the procedure that produces a pseudo-
sample which simulates the real-life sample in which we are
interested. That is, specify the procedural rules by which the
sample is drawn from the simulated universe. These rules must
correspond to the behavior of the real universe in which you are
interested. To put it another way, the simulation procedure must
produce simple experimental events with the same probabilities as
the simple events have in the real world. For example:
In the case of three daughters in four children, or nine
female calves in ten births, you can draw a card and then replace
it if you are using a deck of red and black cards. Or if you are
using a random-numbers table, the random numbers automatically
simulate replacement. Just as the chances of having a female or
a male do not change depending on the sex of the preceding birth,
so we want to ensure through replacement that the chances do not
change each time we choose from the deck of cards.
Recording the outcome of the sampling must be indicated as
part of this step, e.g. "record `yes' if female, `no' if male."
Step(s) C. If several simple events must be combined into a
composite event, and if the composite event was not described in
the procedure in step B, describe it now. For example:
For the number of females in a sample of births, the
procedure for each simple event of a single birth was described
in step B. Now we must specify repeating the simple event four
times, and counting whether the outcome is or is not three girls
in the four childbirths or nine females in ten calves.
Recording of "three or more girls" or "two or less girls",
and "9 or more females" or "8 or fewer", is part of this step.
This record indicates the results of all the trials and is the
basis for a tabulation of the final result.
Step(s) D. Calculate the probability of interest from the
tabulation of outcomes of the resampling trials. For example:
the proportions of a) "yes" and "no", and b) "9 or more" and "8
or fewer", estimate the likelihood we wish to estimate in step C.
An Example: [From Hodges and Lehman, 1970]: Female calves
are more valuable than male calves. A bio-engineer claims to
have a method that can cause more females. He tests the
procedure on ten of your pregnant cows, and the result is nine
females. Should you believe that his method has some effect?
That is, what is the probability of a result this or more
surprising occurring by chance?
The actual computation of probability may be done with
several formulaic or sample-space methods, and in several
resampling methods. I will first show a resampling method and
then several conventional methods. The following material that
allows one to compare resampling and conventional methods is more
germane to the explication of resampling in Chapters 00 and 00
than it is to the theory of hypothesis test discussed in this
chapter, but it is more expedient to present it here.
Computation of Probabilities with Resampling
We can do the problem by hand as follows:
1. Constitute an urn with either one blue and one pink
balls, or 106 blue and 100 pink balls.
2. Draw ten balls with replacement, count pinks, and
record.
3. Repeat step (2) say 400 times.
4. Calculate proportion of results with 9 or 10 pinks.
Or, we can take advantage of the speed and efficiency of the
computer as follows (also in ycha071):
REPEAT 15000
GENERATE 10 1,2 A
COUNT A =1 B
SCORE B Z
END
HISTOGRAM Z
COUNT Z >=9 K
DIVIDE K 15000 KK
PRINT KK
4000+
+ *
+ *
F + *
r + *
e 3000+ * * *
q + * * *
u + * * *
e + * * *
n + * * *
c 2000+ * * *
y + * * * * *
+ * * * * *
* + * * * * *
+ * * * * *
Z 1000+ * * * * *
+ * * * * *
+ * * * * * * *
+ * * * * * * *
+ * * * * * * * * *
0+-----------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
0 2 4 6 8 10
Vector no. 1: Z
Bin Cum
Center Freq Pct Pct
--------------------------------------------
0 22 0.1 0.1
1 163 1.1 1.2
2 650 4.3 5.6
3 1801 12.0 17.6
Resampling Stats - D:\...\CALVES.STA - Wed 7/ 7/93 13:40:59 - Page 2
4 3075 20.5 38.1
5 3717 24.8 62.9
6 3035 20.2 83.1
7 1739 11.6 94.7
8 636 4.2 98.9
9 145 1.0 99.9
10 17 0.1 100.0
Note: Each bin covers all values within 0.1 of its center.
KK = 0.0108
CONVENTIONAL METHODSSample Space and First Principles
Assume for a moment that our problem is a smaller one and
therefore much easier - the probability of getting two females in
two calves if the probability of a female is .5. One could then
map out the sample space, and find the proportion of points that
correspond to a "success". We list all four possible combinations
- FF, FM, MF, MM. Now we look at the ratio of the number of
combinations that have 2 females to total, which is 1/4. We may
then interpret this probability.
We might also use this method for (say) five female calves
in a row. We can make a such as FFFFF, MFFFF, MMFFF,
MMMFFF...MFMFM...MMMMM. There will be 2*2*2*2*2 = 32
possibilities, and 64 and 128 possibilities for six and seven
calves respectively. But when we would get as high as ten
calves, this method would become very troublesome.
Sample Space Calculations
For two females in a row, we could use the well known, and
very simple, multiplication rule; we could do so even for ten
females in a row. But calculating the probability of nine
females in ten is a bit more complex.
Pascal's Triangle
One can use Pascal's Triangle to obtain binomial
coefficients for p = .5 and a sample size of 10, focusing on
those for 9 or 10 successes. Then calculate the proportion of
the total cases with 9 or 10 "successes" in one direction, to
find the proportion of cases that pass beyond the criterion of 9
females. The method of Pascal's Triangle requires more complete
understanding of the probabilistic system than does the
resampling simulation described above because Pascal's Triangle
requires that one understand the entire structure; simulation
requires only that you follow the rules of the model.
The Quincunx
The quincunx is more a simulation method than theoretical,
but it may be considered "conventional". Hence I include it here
for completeness.
Table of Binomial Coefficients
The Pascal Triangle becomes cumbersome or impractical with
large numbers - say, 17 females of 20 births - or with
probabilities other than .5. One might produce the binomial
coefficients by algebraic multiplication, but that, too, becomes
tedious even with small sample sizes. One can also use the pre-
computed table of binomial coefficients found in any standard
text But the probabilities for n = 10 and 9 or 10 females are
too small to be shown.
Binomial Formula
For larger sample sizes, one can use the binomial formula.
The binomial formula gives no deeper understanding of the
statistical structure than does the Triangle, but it does yield a
deeper understanding of the pure mathematics.) With very large
numbers, even the binomial formula is cumbersome.
The Normal Approximation
When the sample size becomes too large for any of the above
methods, one can then use the Normal approximation, which yields
results close to the binomial (as seen very nicely in the output
of the quincunx). But to employ the Normal distribution one
requires an estimate of the standard deviation, which can be
derived either by formula or by resampling. (See a more extended
parallel discussion in Chapter 00 on confidence intervals for an
election poll.)
The desired probability can be obtained from the Z formula
and the a standard table for the Normal distribution found in
every elementary text.
The Z table can be made less mysterious if one generates it
with simulation, or with graph paper or Archimedes' method, using
as raw material (say) five "continuous" (that is, non-binomial)
distributions, many of which are skewed: 1) Draw samples of
(say) 50 or 100. 2) Plot the means to see that the Normal shape
is the outcome. Then 3) standardize with the standard deviation
by marking the standard deviations onto the histograms.
The aim of the above exercise and the heart of the
conventional parametric method is to compare the sample result -
the mean - to a standardized plot of the means of samples drawn
from the universe of interest to see how likely it is that that
universe produces means deviating as much from the universe mean
as does our observed sample mean. The steps are:
1. Establish the Normal shape - from the exercise above, or
from the quincunx or Pascal's Triangle or the binomial formula or
the formula for the Normal approximation or some other device.
2. Standardize that shape in standard deviations.
3. Compute the Z score for the sample mean - that is, its
deviation from the universe mean in standard deviations.
4. Examine the Normal (or really, tables computed from
graph paper, etc.) to find the likelihood of a mean being that
far by chance.
This is the canon of the procedure for most parametric work
in statistics. For some small samples, accuracy is improved by
using the t distribution, a matter discussed in Chapter 00.
CHOICE OF THE BENCHMARK UNIVERSE<1>
In the example of the ten calves, the choice of a benchmark
universe - a universe that (on average) produces equal
proportions of males and females - seems rather straightforward
and even automatic, requiring no difficult judgments. But in
other cases the process requires more judgments to be made.
Let's consider another case where the choice of a benchmark
universe requires no difficult judgments. Assume the U.S.
Department of Labor's Bureau of Labor Statistics takes a very
large sample - say, 20,000 persons - and finds a 10 percent
unemployment rate. At some later time another but smaller sample
is drawn - 2,000 persons - showing an 11 percent unemployment
rate. Should BLS conclude that unemployment has risen, or is
there a large chance that the difference between 10 percent and
11 percent is due to sample variability? In this case, it makes
rather obvious sense to ask how often a sample of 2,000 drawn
from a universe of 10 percent unemployment (ignoring the
variability in the larger sample) will be as different as 11
percent just due to sample variability? This problem differs
from that of the calves only in the proportions and the sizes of
the samples.
Let's change the facts and assume that a very large sample
had not been drawn and only a sample of 2,000 had been taken,
indicating 11 percent unemployment. A policy-maker asks the
likelihood that unemployment is above ten percent. It would
still seem rather straightforward to ask how often a universe of
10 percent unemployment would produce a sample of 2000 with a
proportion of 11 percent unemployed.
Still another problem where the choice of benchmark
hypothesis is relatively straightforward: Say that BLS takes two
samples of 2000 persons a month apart, and asks whether there is
a difference in the results. Pooling the two samples and
examining how often two samples drawn from the pooled universe
are as different as are observed seems obvious.
One of the reasons that the above cases - especially the
two-sample case - seems so clearcut is that the variance of the
benchmark hypothesis is not an issue, being implied by the fact
that the samples deal with proportions. If the data were
continuous, however, this issue would quickly arise. Consider,
for example, that the BLS might take the same sorts of samples
and ask unemployed persons the lengths of time they had been
employed. Comparing a small sample to a very large one would be
easy to decide about. And even comparing two small samples might
be straightforward - simply pooling them as is.
But what about if you have a sample of 2,000 with data on
lengths of unemployment spells with a mean of 30 days, and you
are asked the probability that it comes from a universe with a
mean of 25 days? Now there arises the question about the amount
of variability to assume for that benchmark universe. Should it
be the variability observed in the sample? That is probably an
overestimate, because a universe with a smaller mean would
probably have a smaller variance, too. So some judgment is
required; there cannot be an automatic "objective" process here,
whether one proceeds with the conventional or the resampling
method.
The example of the comparison of liquor retailing systems in
Chapter 00 provides more material on this subject.
THE CONCEPT OF STATISTICAL SIGNIFICANCE IN TESTING HYPOTHESES
Hypothesis tests using the concept of significance have been
misused almost since their origin; the flaws were pointed out
early on by my friend and editor, Hanan Selvin, and since then
have been discussed often and so well that no discussion is
needed here. This section offers only an interpretation of the
meaning of "significant" in connection with the logic of
significance tests.
1. Consider the nine-year-old who tells the teacher
that the dog ate the homework. Why does the teacher not accept
the excuse? Clearly it is because the event would be too
"unusual". But why do we think that way?
Let's speculate that you survey a million adults, and only
three report that they have ever heard of a real case where a dog
ate somebody's homework. You are a teacher, and a student comes
in without homework and says that a dog ate the homework. It
could have happened -- your survey reported that it really has
happened in three lifetimes out of a million. But it does not
happen very often.
Therefore, you probably conclude that because the event is
so unlikely, something else must have happened -- for example,
that the student did not do the homework. The logic is that if
something seems very unlikely, it would therefore surprise us
greatly if it were to actually happen, and therefore we assume
that there must be a better explanation. This is why we look
askance at unlikely coincidences when they are to someone's
benefit.
This is the logic of John Arbuthnot's hypothesis test about
the ratio of births by sex in the first published hypothesis test
(see Chapter 00), though his extension of his logic to God's
design goes beyond the standard modern framework. It is also the
implicit logic in the research on puerperal fever, cholera, and
beri-beri, the data for which were shown in Chapter 00, though no
explicit mention was made of probability in those cases.
2) Two students sit next to each other at an examination.
Out of a hundred questions each student gets 82 right, and each
of the mistakes that they make is on the same questions. Do you
believe that the students cheated?
You say to yourself: It would be most unlikely that they
would have made the same mistakes by chance -- and you can
compute how unlikely it would be -- and because it is so unlikely
you therefore are likely to believe that they cheated.
3) The court is hearing a murder case. There is no eye-
witness, and the evidence consists of such facts as the height
and weight and age of the person charged, and other
circumstantial evidence. Only one person in 50 million has such
characteristics, and you find such a person. Will you convict
the person, or will you assume that the evidence might have
occurred just by chance? Of course it might have occurred by bad
luck, but the probability is very very small. Will you therefore
conclude that because the chance is so small, it is reasonable to
assume that the person charged committed the crime?
Sometimes the unusual really happens - the court errs by
judging that the wrong person did it, and that person goes to
prison or even is executed. The best we can do is to make the
criterion strict: "Beyond a reasonable doubt". (People ask:
What probability does that criterion represent? But the court
will not provide a numerical answer.)
4) Somebody says to you: I am going to deal out five cards
and it will be a royal flush - ten, jack, queen, king, and ace of
a given suit. The person deals the cards and the royal flush
appears. Do you think the occurrence happens just by chance?
No, you are likely to be very dubious that it happened by chance.
Then you believe there must be some other explanation -- that the
person fixed the cards, for example.
Note: You don't attach the same meaning to any other
permutation, even though it is as rare - unless the person
announced it in advance.
Indeed, even if the person says nothing, you will be
surprised at a royal flush, because this hand has meaning,
whereas another given set of five cards do not.
Two important points complicate the concept of statistical
significance:
1. With a large enough sample, every treatment or variable
will seem different from every other. Two faces of even a good
die will produce different results in the very long run. Other
statistics help interpret these results - for example, the beta
coefficient or the partial regression coefficient (see Chapter
00).
2. Statistical significance does not imply economic or
social significance. Two faces of a die may be statistically
significant in a huge sample of throws, but a 1/10,000 difference
is too small to make an economic difference in betting.
Statistical significance is only a filter. If it appears, one
should then proceed to decide whether there is substantive
significance.
Interpreting significance is sometimes complex, especially
when the interpretation depends heavily upon your prior
expectations - as it often does. For example, how should a
basketball coach decide whether or not to bench a player for poor
performance after a series of missed shots at the basket?
Consider Coach John Thompson who, after Charles Smith missed
10 of 12 shots in the 1989 Georgetown-Notre Dame NCAA game, took
Smith out of the game for a time (The Washington Post, March 20,
1989, p. C1). The scientific or decision problem is: Should the
coach consider that Smith is not now a 47 percent shooter as he
normally is, and therefore bench him? The statistical question
is: How likely is a shooter with a 47 percent average to produce
10 of 12 misses?
Would Coach Thompson take Smith out of the game after he
missed one shot? Clearly not. Why not? Because one "expects"
Smith to miss a shot half the time, and missing one shot
therefore does not seem unusual.
How about after Smith misses two shots in a row? For the
same reason the coach still would not bench him, because this
event happens "often" -- more specifically, about once in every
sequence of four shots.
How about after 9 misses out of ten shots? Notice the
difference between this case and 9 female calves of ten. In the
case of the calves we expected half females because the
experiment is a single isolated trial. The event considered by
itself has a small enough probability that it seems unexpected
rather than expected. "Unexpected" seems to be closely related
to "happens seldom" or "unusual" in our psychology. And an event
that happens seldom seems to call for explanation, and also seems
to promise that it will yield itself to explanation by some
unusual concatenation of forces. That is, unusual events lead us
to think that they have unusual causes; that is the nub of the
problem. (But on the other hand, one can sometimes benefit by
paying attention, as scientists know when they investigate
outliers.)
In basketball shooting, we expect 47 percent of Smith's
individual shots to be successful, and we also expect that
average for each set of shots. But we also expect some sets of
shots to be far from that average because we observe many sets;
such variation is inevitable. So when we see a single set of 9
misses in ten shots, we are not very surprised.
But how about 29 misses in 30 shots? At some point, one
must start to pay attention. (And of course we would pay more
attention if beforehand, and never at any other time, the player
said, "I can't see the basket today. My eyes are dim".)
So, how should one proceed? Perhaps the same way as with a
coin that keeps coming down heads a very large proportion of the
throws, over a long series of tosses: At some point you examine
it to see if it has two heads. But if your investigation is
negative, in the absence of an indication other than the behavior
in question, you continue to believe that there is no explanation
and you assume that the event is "chance" and should not be acted
upon. In the same way, a coach might ask a player if there is an
explanation for the many misses. But if the player answers "no",
the coach should not bench him. (There are difficulties here
with truth-telling, of course, but let that go for now.)
The key point for the basketball case and other repetitive
situations is not to judge that there is an unusual explanation
from the behavior of a single sample alone, just as with a short
sequence of stock-price changes.
We all need to learn that "irregular" (a good word here)
sequences are less unusual than they seem to the naked intuition.
A streak of 10 out of 12 misses for a 47 percent shooter occurs
about every 3 percent of the time. That is, about every 33 shots
Smith takes, he will begin a sequence of 12 shots that will end
with 3 or fewer baskets - perhaps once in every couple of games.
This does not seem "very" unusual, perhaps. And if the coach
treats each such case as unusual, he will be losing some of the
services of a better player than he replaces him with.
In brief, How hard one should look for an explanation should
depend on the likelihood of the event. But one should (almost)
assume the absence of an explanation unless one actually finds
it.
Bayesian analysis could be brought to bear upon the matter,
bringing in your prior probabilities based on knowledge that
research has shown that there is no such thing as a "hot hand" in
basketball, together with some sort of cost-benefit error-loss
calculation comparing Smith and next best available player.
The "data-dredging" issue was discussed in the context of
the doctors' smoking by states in Chapter 00.
ENDNOTE
**ENDNOTES**
<1>: This is one of many issues that Peter Bruce first raised,
and whose treatment here reflects back-and-forth discussion
between us.