CHAPTER III-1
THE RESAMPLING METHOD FOR STATISTICAL INFERENCE
INTRODUCTION
This chapter introduces the resampling method in historical
and theoretical perspective, and illustrates the method.
About 1615, Italian gamblers brought to Galileo Galilei a
problem in the game of three dice. The theorists of the day had
figured as equal the chances of getting totals of 9 and 10 (also
11 and 12), because there are the same number of ways (six) of
making those points -- for example, a nine can be 126, 135, 144,
234, 225, and 333. But players had found that in practice 10 is
made more often than 9. How come?
Galileo then invented the device of the "sample space" of
possible outcomes. He colored three dice white, gray, and black,
and systematically listed every possible permutation. The previ-
ous theorists - including Gottfried Leibniz - had instead lumped
together into a single category the various possible ways of
getting (say) a 3, 3, and 4 to make 10. That is, they listed
combinations rather than permutations, and various combinations
contain different numbers of permutations.
Galileo's analysis confirmed the gamblers' empirical re-
sults. Ten does come up more frequently than 9, because there
are 27 permutations that add to 10 whereas there are only 25
permutations that add to 9.
The use of repeated trials to learn what the gamblers wanted
to know illustrates the power of the resampling method -- which
we can simply call "simulation" or "experimentation" here. And
with sufficient repetitions, one can arrive at as accurate an
answer as desired. Not only is the resampling method adequate,
but in the case of three dice it was a better method than deduc-
tive logic, because it gave the more correct answer. Though the
only logic needed was enumeration of the possibilities, it was
too difficult for the doctors of the day. The powers of a Gali-
leo were necessary to produce the correct logic.
Even after Galileo's achievement, the powers of Blaise
Pascal and Pierre Fermat were needed to correctly analyze with
the multiplication rule another such problem - the chance of at
least one ace in four dice throws. (This problem, presented by
the Chevalier de la Mere, is considered the origin of probability
theory.) For lesser mathematical minds, the analysis was too
difficult. Yet ordinary players were able to discern the correct
relative probabilities, even though the differences in probabili-
ties are slight in both the Galileo and Pascal-Fermat problems.
Simulation's effectiveness is its best argument.
One might rejoin that the situation is different after Gali-
leo, Pascal, Fermat and their descendants have invented analytic
methods to handle such problems correctly. Why not use already
existing analytic methods instead of resampling?
The existence of a correct algorithm does not imply that it
will be used appropriately, however. And a wrongly-chosen algo-
rithm is far worse than no algorithm at all -- as the Chevalier's
pocketbook attested. In our own day, decades of experience have
proven that "pluginski" -- the memorization of formulas that one
cannot possibly understand intuitively -- may enable one to
survive examinations, but does not provide usable scientific
tools.
THE DEFINITION OF RESAMPLING AGAIN
Let's again define briefly "resampling". A statistical
procedure models a physical process. A resampling method
simulates the model with easy-to-manipulate symbols. Either the
observed data (in the case of problem in statistical inference),
or a data-generating mechanism such as a die (if it is a problem
in probability), are used to produce new hypothetical samples,
the properties of which can then be examined. The resampler
postulates a universe and then examines how it behaves - in the
case of statistics, comparing the outcomes to a criterion that we
choose. More extended definition and discussion will be found in
Chapter III-3.
The same logic applies to problems labeled "probability" and
"statistical inference". The only difference is that in proba-
bility problems the "model" is known in advance -- say, the model
implicit in a deck of poker cards plus a game's rules for dealing
and counting the results -- rather than the model being assumed
to be best estimated by the observed data, as in resampling
statistics. (Every problem in statistical inference contains a
problem in probability at its core.)
Now that resampling has become respectable, some statisti-
cians grumble that it is "just" the Monte Carlo method, or "just"
simulation. But earlier on the body of resampling methods for
statistics were not set forth under those labels.
It is the overall approach - the propensity to turn first to
resampling methods to handle practical problems - that most
clearly distinguishes resampling from conventional statistics,
and from the earlier use of Monte Carlo methods. (In the nine-
teenth century, "simulation techniques were all tied to the
normal distribution, and all involved generating errors to be
added to a signal", according to historian Stephen Stigler.) In
addition, some resampling methods are new in themselves, the
result of the basic resample-it tendency of the past quarter
century.
RESAMPLING AND STATISTICAL INFERENCE
Chapter II-2 mentioned how John Arbuthnot, doctor to Queen
Anne of England, observed that more boys than girls were bornin
London; records showed that male births exceeded female 82 years
in a row. Arbuthnot therefore set forth to test the hypothesis
that a universe with a 50-50 probability of producing males could
result in 82 successive years with preponderantly male births.
Arbuthnot used the multiplication rule of Pascal and Fermat
to calculate that the probability of (1/2)82 is extremely small.
"From whence it follows, that it is Art, not Chance, that "gov-
erns" - that is, "Divine Providence". (His argument is complex
and debatable, as statistical inference often is; the mathematics
is the easy part, especially when resampling methods are used.)
Please notice that Arbuthnot could have considered the
numbers of boys and girls observed in each year, rather than
treating each year as a single observation - an even stronger
test because of the vast amounts of information. Arbuthnot
surely did not analyze the data for any or all of the individual
years because the calculus of probability was still in its infan-
cy.
Luckily, the test Arbuthnot made was more than powerful
enough for his purposes. But if instead of 82 years in a row,
only (say) 81 or 61 of the 82 years had shown a preponderance of
males, Arbuthnot would have lacked the tools for a test (though
he knew the binomial and logarithms). Nowadays, one conventional-
ly uses the Gaussian (Normal) approximation to the binomial
distribution to produce the desired probability. But that method
requires acquaintance with a considerable body of statistical
procedure, and utilizes a formula that almost no one knows and
even fewer can explain intuitively. Instead, users simply "plug
in" the data to a table which, because it is an arcane mystery,
invites misuse and erroneous conclusions.
The experimental resampling method of earlier gamblers could
easily have given Arbuthnot a satisfactory answer for (say) 61 of
82 years, however. He had in fact likened the situation to a set
of 82 coins. He could simply have tossed such a set repeatedly,
and found that almost never would as many as 81 or 61 heads
occur. He could then have rested as secure in his conclusion as
with the formulaic assessment of the probability of 82 years in a
row. And because of the intuitive clarity of the experimental
method, one would not be likely to make a misleading error in
such a procedure.
By the grace of the computer, such problems can be handled
more conveniently today. The self-explanatory commands in Figure
III-1-1 suffice, using the language RESAMPLING STATS and pro-
ducing the results shown there.
Figure III-1-1
The intellectual advantage of the resampling method is that
though it takes repeated samples from the sample space, it does
not require that one know the size of the sample space or of a
particular subset of it. To estimate the probability of getting
(say) 61 males in 82 births with the binomial formula requires
that one calculate the number of permutations of a total of 82
males and females, and the number of those permutations that
include 61 or more males. In contrast, with a resampling ap-
proach one needs to know only the conditions of producing a
single trial yielding a male or female. This conceptual differ-
ence, which will be discussed at greater length below, is the
reason that, compared to conventional methods, resampling is
likely to have higher "statistical utility" - a compound of
efficiency plus the chance that the ordinary scientist or deci-
sion-maker will use a correct procedure.
VARIETIES OF RESAMPLING METHODS
A resampling test may be constructed for every case of
statistical inference - by definition. Every real-life situation
can be modeled by symbols of some sort, and one may experiment
with this model to obtain resampling trials. A resampling method
should always be appropriate unless there are insufficient data
to perform a useful resampling test, in which case a conventional
test - which makes up for the absence of observations with an
assumed theoretical distribution such as the Normal or Poisson -
may produce more accurate results if the universe from which the
data are selected resembles the chosen theoretical distribution.
Exploration of the properties of resampling tests is an active
field of research at present. Chapter III-2 will discuss the
various types of resampling methods.
For the main tasks in statistical inference - hypothesis
testing and confidence intervals - the appropriate resampling
test often is immediately obvious. For example, if one wishes to
inquire whether baseball hitters exhibit behavior that fits the
notion of a slump, one may simply produce hits and outs with a
random-number generator adjusted to the batting average of a
player, and then compare the number of simulated consecutive
sequences of either hits or outs with the observed numbers for
the player. The procedure is also straightforward for such bino-
mial situations as the Arbuthnot birth-sex case.
Two sorts of procedures are especially well-suited to resam-
pling: 1) A sample of the permutations in Fisher's "exact" test
(confusingly, also called a "randomization" test). This is
appropriate when the size of the universe is properly assumed to
be fixed, as discussed below. 2) The bootstrap procedure. This
is appropriate when the size of the universe is properly assumed
not to be fixed.
Let's compare the permutation and bootstrap procedures in
the context of a case which might be analyzed either way. The
discussion will highlight some of the violent disagreements in
the philosophy of statistics which the use of resampling methods
frequently brings to the surface - one of its great benefits.
In the 1960s I studied the price of liquor in the sixteen
"monopoly" states (where the state government owns the retail
liquor stores) compared to the twenty-six states in which retail
liquor stores are privately owned. (Some states were omitted for
technical reasons. The situation and the price pattern has
changed radically since then.)
These were the representative 1961 prices of a fifth of Sea-
gram 7 Crown whiskey in the two sets of states:
16 monopoly states: $4.65, $4.55, $4.11, $4.15, $4.20,
$4.55, $3.80, $4.00, $4.19, $4.75, $4.74, $4.50, $4.10,
$4.00, $5.05, $4.20
26 private-ownership states: $4.82, $5.29, $4.89,
$4.95, $4.55, $4.90, $5.25, $5.30, $4.29, $4.85, $4.54,
$4.75, $4.85, $4.85, $4.50, $4.75, $4.79, $4.85, $4.79,
$4.95, $4.95, $4.75, $5.20, $5.10, $4.80, $4.29.
The economic question that underlay the investigation -
having both theoretical and policy ramifications - is as follows:
Does state ownership affect prices? The empirical question is
whether the prices in the two sets of states were systematically
different. In statistical terms, we wish to test the hypothesis
that there was a difference between the groups of states related
to their mode of liquor distribution, or whether instead the ob-
served $.49 differential in means might well have occurred by
happenstance. In other words, we want to know whether the two
sub-groups of states differed systematically in their liquor
prices, or whether the observed pattern could well have been
produced by chance variability.
At first I used a resampling permutation test as follows:
Assuming that the entire universe of possible prices consists of
the set of events that were observed, because that is all the
information available about the universe, I wrote each of the
forty-two observed state prices on a separate card. The shuffled
deck simulated a situation in which each state has an equal
chance for each price.
On the "null hypothesis" that the two groups' prices do not
reflect different price-setting mechanisms, but rather differ
only by chance, I then examined how often that simulated universe
stochastically produces groups with results as different as
observed in 1961. I repeatedly dealt groups of 16 and 26 cards,
without replacing the cards, to simulate hypothetical monopoly-
state and private-state samples, each time calculating the dif-
ference in mean prices.
The probability that the benchmark null-hypothesis universe
would produce a difference between groups as large or larger than
observed in 1961 is estimated by how frequently the mean of the
group of randomly-chosen sixteen prices from the simulated state-
ownership universe is less than (or equal to) the mean of the
actual sixteen state-ownership prices. If the simulated differ-
ence between the randomly-chosen groups was frequently equal to
or greater than observed in 1961, one would not conclude that the
observed difference was due to the type of retailing system
because it could well have been due to chance variation.
The computer program in Figure III-1-2, using the language
RESAMPLING STATS performs the operations described above (MATHE-
MATICA and APL could be used in much the same fashion).
Figure III-1-2
The results shown - not even one "success" in 10,000 trials
- imply a very small probability that two groups with mean prices
as different as were observed would happen by chance if drawn
from the universe of 42 observed prices. So we "reject the null
hypothesis" and instead find persuasive the proposition that the
type of liquor distribution system influences the prices that
consumers pay.
As I shall discuss later, the logical framework of this
resampling version of the permutation test differs greatly from
the formulaic version, which would have required heavy computa-
tion. The standard conventional alternative would be a Student's
t-test, in which the user simply plugs into an unintuitive formu-
la and table. And because of the unequal numbers of cases and
unequal dispersions in the two samples, an appropriate t test is
far from obvious, whereas resampling is not made more difficult
by such realistic complications.
A program to handle the liquor problem with an infinite-
universe bootstrap distribution simply substitutes the random
sampling command GENERATE for the TAKE command in Figure III-1-2.
The results of the new test are indistinguishable from those in
Figure III-1-2.
INFARCTION AND CHOLESTEROL: RESAMPLING VERSUS CONVENTIONAL<1>
Let's now consider one of the simplest numerical examples of
probabilistic-statistical reasoning given toward the front of a
standard book on medical statistics (Kahn and Sempos, 1989).
Using data from the Framingham study, the authors ask: What is
an appropriate "confidence interval" on the observed ratio of
"relative risk" (a measure which is defined below; it is closely
related to the odds ratio) of the development of myocardial
infarction 16 years after the study began, for men ages 35-44
with serum cholesterol above 250, relative to those with serum
cholesterol below 250? The raw data are shown in Table III-1-1
(divided into "high" and "low" cholesterol by Kahn and Sempos).
Table III-1-1
Hypothesis Tests With Measured Data
Consider this classic question about the Framingham serum
cholesterol data: What is the degree of surety that there is a
difference in myocardial infarction rates between the high- and
low-cholesterol groups?
The statistical logic begins by asking: How likely is that
the two observed groups "really" came from the same "population"
with respect to infarction rates? Operationally, we address this
issue by asking how likely it is that two groups as different in
disease rates as the observed groups would be produced by the
same "statistical universe".
Key step: we assume that the relevant "benchmark" or "null-
hypothesis" population (universe) is the composite of the two
observed groups. That is, if there were no "true" difference in
infarction rates between the two serum-cholesterol groups, and
the observed disease differences occurred just because of
sampling variation, the most reasonable representation of the
population from which they came is the composite of the two
observed groups.
Therefore, we compose a hypothetical "benchmark" universe
containing (135 + 470 =) 605 men at risk, and designate (10 + 21
=) 31 of them as infarction cases. We want to determine how
likely it is that a universe like this one would produce - just
by chance - two groups that differ as much as do the actually
observed groups. That is, how often would random sampling from
this universe produce one sub-sample of 135 men containing a
large enough number of infarctions, and the other sub-sample of
470 men producing few enough infarctions, that the difference in
occurrence rates would be as high as the observed difference of
.029? (10/135 = .074, and 21/470 = .045, and .074 - .045 =
.029).
So far, everything that has been said applies both to the
conventional formulaic method and to the "new statistics"
resampling method. But the logic is seldom explained to the
reader of a piece of research - if indeed the researcher
her/himself grasps what the formula is doing. And if one just
grabs for a formula with a prayer that it is the right one, one
need never analyze the statistical logic of the problem at
hand.
Now we tackle this problem with a method that you would
think of yourself if you began with the following mind-set: How
can I simulate the mechanism whose operation I wish to under-
stand? These steps will do the job:
1. Fill an urn with 605 balls, 31 red (infarction) and the
rest (605 - 31 = 574) green (no infarction).
2. Draw a sample of 135 (simulating the high serum-
cholesterol group), one ball at a time and throwing it back
after it is drawn to keep the simulated probability of an
infarction the same throughout the sample; record the number of
reds. Then do the same with another sample of 470 (the low
serum-cholesterol group).
3. Calculate the difference in infarction rates for the two
simulated groups, and compare it to the actual difference of
.029; if the simulated difference is that large, record "Yes" for
this trial; if not, record "No".
4. Repeat steps 2 and 3 until a total of (say) 400 or 1000
trials have been completed. Compute the frequency with which the
simulated groups produce a difference as great as actually
observed. This frequency is an estimate of the probability that
a difference as great as actually observed in Framingham would
occur even if serum cholesterol has no effect upon myocardial
infarction.
The procedure above can be carried out with balls in a
ceramic urn in a few hours. Yet it is natural to seek the added
convenience of the computer to draw the samples. Therefore, we
illustrate in Figure III-1-4 how a simple computer program
handles this problem. We use RESAMPLING STATS but the program can
be executed in other languages as well, though usually with more
complexity and less clarity.
Figure III-1-4
The results of the test using this program may be seen in
the histogram in Figure III-1-4. We find - perhaps surprisingly
- that a difference as large as observed would occur by chance
fully 10 percent of the time. (If we were not guided by the
theoretical expectation that high serum cholesterol produces
heart disease, we might include the 10 percent difference going
in the other direction, giving a 20 percent chance). Even a ten
percent chance is sufficient to strongly call into question the
conclusion that high serum cholesterol is dangerous. At a
minimum, this statistical result should call for more research
before taking any strong action clinically or otherwise.
Where should one look to determine which procedures should
be used to deal with a problem such as set forth above? Unlike
the formulaic approach, the basic source is not a manual which
sets forth a menu of formulas together with sets of rules about
when they are appropriate. Rather, you consult your own
understanding about what it is that is happening in (say) the
Framingham situation, and the question that needs to be answered,
and then you construct a "model" that is as faithful to the facts
as is possible. The urn-sampling described above is such a model
for the case at hand.
To connect up what we have done with the conventional
approach, we apply a z test (conceptually similar to the t test,
but applicable to yes-no data; it is the Normal-distribution
approximation to the large binomial distribution) and we find
that the results are much the same as the resampling result - an
eleven percent probability.
Someone may ask: Why do a resampling test when you can use
a standard device such as a z or t test? The great advantage of
resampling is that it avoids using the wrong method. The
researcher is more likely to arrive at sound conclusions with
resampling because s/he can understand what s/he is doing,
instead of blindly grabbing a formula which may be in error.
The textbook from which the problem is drawn is an excellent
one; the difficulty of its presentation is an inescapable
consequence of the formulaic approach to probability and
statistics. The body of complex algebra and tables that only a
rare expert understands down to the foundations constitutes an
impenetrable wall to understanding. Yet without such
understanding, there can be only rote practice, which leads to
frustration and error.
Confidence Intervals for the Counted Data
So far we have discussed the interpretation of sample data
for testing hypotheses. The devices used for the other main
theme in statistical inference - the estimation of confidence
intervals - are much the same as those used for testing hypothe-
ses. Indeed, the bootstrap method discussed above was originally
devised for estimation of confidence intervals. The bootstrap
method may also be used to calculate the appropriate sample size
for experiments and surveys, another important topic in statis-
tics.
Consider for now just the data for the sub-group of 135
high-cholesterol men. A second classic statistical question is
as follows: How much confidence should we have that if we were
take a much larger sample than was actually obtained, the sample
mean (that is, the proportion 10/135 = .07) would be in some
close vicinity of the observed sample mean? Let us first carry
out a resampling procedure to answer the questions, waiting until
afterwards to discuss the logic of the inference.
1. Construct an urn containing 135 balls - 10 red
(infarction) and 125 green (no infarction) to simulate the
universe as we guess it to be.
2. Mix, choose a ball, record its color, replace it, and
repeat 135 times (to simulate a sample of 135 men).
3. Record the number of red balls among the 135 balls
drawn.
4. Repeat steps 2-4 perhaps 1000 times, and observe how
much the number of reds varies from sample to sample. We
arbitrarily denote the boundary lines that include 47.5 percent
of the hypothetical samples on each side of the sample mean as
the 95 percent "confidence limits" around the mean of the actual
population.
Figure III-1-5 shows how this can be done easily on the
computer, together with the results.
Figure III-1-5
The variation in the histogram in Figure III-1-5 highlights
the fact that a sample containing only 10 cases of infarction is
very small, and the number of observed cases - or the proportion
of cases - necessarily varies greatly from sample to sample.
Perhaps the most important implication of this statistical
analysis, then, is that we badly need to collect additional
data.
This is a classic problem in confidence intervals, found in
all subject fields. For example, at the beginning of the first
chapter of a best-selling book in business statistics, Wonnacott
and Wonnacott use the example of a 1988 presidential poll. The
language used in the cholesterol-infarction example above is
exactly the same as the language used for the Bush-Dukakis poll
except for labels and numbers.
Also typically, the text gives a formula without explaining
it, and says that it is "fully derived" eight chapters later
(Wonnacott and Wonnacott, 1990, p. 5). With resampling, one
never needs such a formula, and never needs to defer the
explanation.
The philosophic logic of confidence intervals is quite deep
and controversial, less obvious than for the hypothesis test.
The key idea is that we can estimate for any given universe the
probability P that a sample's mean will fall within any given
distance D of the universe's mean; we then turn this around and
assume that if we know the sample mean, the probability is P that
the universe mean is within distance D of it. This inversion is
more slippery than it may seem. But the logic is exactly the
same for the formulaic method and for resampling. The only
difference is how one estimates the probabilities - either with a
numerical resampling simulation (as here), or with a formula or
other deductive mathematical device (such as counting and
partitioning all the possibilities, as Galileo did when he
answered a gambler's question about three dice.) And when one
uses the resampling method, the probabilistic calculations are
the least demanding part of the work. One then has mental
capacity available to focus on the crucial part of the job -
framing the original question soundly, choosing a model for the
facts so as to properly resemble the actual situation, and
drawing appropriate inferences from the simulation.
If you have understood the general logic of the procedures
used up until this point, you are in command of all the necessary
conceptual knowledge to construct your own tests to answer any
statistical question. A lot more practice, working on a variety
of problems, obviously would help. But the key elements are
simple: 1) Model the real situation accurately, 2) experiment
with the model, and 3) compare the results of the model with the
observed results.
Confidence Intervals on Relative Risk With Resampling
Now we are ready to calculate - with full understanding -
the confidence intervals on relative risk that the Kahn-Sempos
text sought. Recall that the observed sample of 135 high-
cholesterol men had 10 infarctions (a proportion of .074), and
the sample of 470 low-cholesterol men had 21 infarctions (a
proportion of .045). We estimate the relative risk of high
cholesterol as .074/.045. Let us frame the question this way:
If we were to randomly draw a sample from the universe of high-
cholesterol men that is best estimated from our data (.074
infarctions), and a sample from the universe of low-cholesterol
men (.045 infarctions), and do this again and again, within which
bounds would the relative risk calculated from that simulation
fall (say) 95 percent of the time?
The operation is quite the same as that for a single
confidence interval estimated above except that we do the
operation for both sub-samples at once, and then calculate the
ratio between their results to determine the relative risk. As
before, we would like to know what would happen if we could take
additional samples from the universes that spawned our actual
samples. Lacking the resources to do so, we let those original
samples "stand in" for the universes from which they came,
serving as proxy "substitute universes." It is as if we
replicate each sample element a million times and then take
"bootstrap" samples from this "proxy universe." Paralleling the
real world, we take simulated samples of the same size as our
original samples. (In practice we need not replicate each sample
element a million times but instead achieve the same resampling
effect by sampling with replacement from our original samples --
that way, the chance that a sample element will be drawn remains
the same from draw to draw.) We count the number of infarctions
in each of our resamples, and for the pair of resamples, then
calculate the relative risk measure and keep score of this re-
sult. We repeat with additional pairs of resamples, each time
calculating the relative risk measure, and examine the overall
results.
We may compare our results in Figure III-1-6 - a confidence
interval extending from 0.69 to 3.4 - to the results given by
Kahn and Sempos, which are 0.79 to 3.5, 0.80 to 3.4, and 0.79 to
3.7 from three different formulas (pp. 62-63); our agreement is
close.
Figure III-1-6
OTHER RESAMPLING TECHNIQUES
We have so far seen examples of three of the most common
resampling methods - binomial, permutation, and bootstrap. These
methods may be extended to handle correlation, regression, and
tests where there are three or more groups. Indeed, resampling
can be used for every other statistic in which one may be inter-
ested - for example, statistics based on absolute deviations
rather than squared deviations. This flexibility is a great
virtue because it frees the statistics user from the limited and
oft-confining battery of textbook methods.
SOME OTHER ILLUSTRATIONS A Measured-Data
Example: Test of a Drug to Prevent Low Birthweight
The Framingham infarction-cholesterol examples worked with
yes-no "count" data. Let us therefore consider some
illustrations of the use of resampling with measured data.
Another leading textbook (Rosner, 1982, p. 257) gives the
example of a test of the hypothesis that drug A prevents low
birthweights. The data for the treatment and control groups are
shown in Table III-1-2. The treatment group averaged .82 pounds
more than the control group. Here is a resampling approach to
the problem:
Table III-1-2
1. If the drug has no effect, our best guess about the
"universe" of birthweights is that it is composed of (say) a
million each of the observed weights, all lumped together. In
other words, in the absence of any other information or
compelling theory, we assume that the combination of our samples
is our best estimate of the universe. Hence let us write each of
the birthweights on a card, and put them into a hat. Drawing
them one by one and then replacing them is the operational equiv-
alent of a very large (but equal) number of each birthweight.
2. Repeatedly draw two samples of 15 birthweights each, and
check how frequently the observed difference is as large as, or
larger than, the actual difference of .82 pounds.
We find in Figure III-1-7 that only 1% of the pairs of
hypothetical resamples produced means that differed by as much
as .82. We therefore conclude that the observed difference is
unlikely to have occurred by chance.
Figure III-1-7
Matched-Patients Test of Three Treatments
There have been several recent three-way tests of treatments
for depression: drug versus cognitive therapy versus combined
drug and cognitive therapy. Consider this procedure for a
proposed test in 31 triplets of people that have been matched
within triplet by sex, age, and years of education. The three
treatments are to be chosen randomly within each triplet. Assume
that the outcomes on a series of tests were ranked from best (#1)
to worst (#3) within each triplet, and assume that the combined
drug-and-therapy regime has the highest average rank. How sure
can we be that the observed result would not occur by chance?
In hypothetical Table III-1-3 the average rank for the drug
and therapy regime is 1.74. Is it likely that the regimes do not
"really" differ with respect to effectiveness, and that the drug
and therapy regime came out with the best rank just by the luck
of the draw? We test by asking, "If there is no difference, what
is the probability that the treatment of interest will get an
average rank this good, just by chance?"
Table III-1-3
Figure III-1-8 shows a program for a resampling procedure
that repeatedly produces 31 ranks randomly selected among the
numbers 1, 2 and 3, and then averages the ranks. We can then
observe whether an average of 1.74 is unusually low, and hence
should not be ascribed to chance.
Figure III-1-8
In 1000 repetitions of the simulation (10,000 would take
just a few moments longer), 5% yielded average ranks as low as
the observed value. This is evidence that something besides
chance might be at work here. (The result is at the borderline
of the traditional 5% "level of significance" (a p-value of .05),
supposedly set arbitrarily by the great biostatistician R.A.
Fisher on the grounds that a 1-in-20 happening is too
coincidental to ignore.) That is, the resampling test suggests
that it would be very unlikely for a given treatment regime to
achieve, just by chance, results as good as are actually
observed.
An interesting feature of the treatment problem is that it
would be hard to find a conventional test that would handle this
three-way comparison in an efficient manner. Certainly it would
be impossible to find a test not requiring formulae and tables
that only a talented professional statistician could manage
satisfactorily, and even s/he is not likely to fully understand
those formulaic procedures.
THE COMPUTER AND RESAMPLING
Some now refer to resampling as "computer-intensive
statistics" (e. g. Noreen, 1986). And others have written that
resampling had to await the easy availability of computers. It
is, however, arguable that computer cooperation is a crucial
element of the sampling method. Resampling operations (including
the bootstrap and permutation procedures) can often be conducted
quite satisfactorily with simpler tools. Indeed, the permuation
test for the liquor prices example above was indeed done by hand
with cards, and a bootstrap test could similarly have been done.
Nevertheless, the inconvenience of doing tests by hand was a
barrier to implementation and adoption. Therefore, in the early
1970s developed Dan Weidenfeld and I a computer language and a
program for the mainframe that carries out resampling operations
(including permutation tests, the bootstrap, and just about every
other device) more expeditiously than simpler tools such as
coins, dice, and random-number tables (Simon and Weidenfeld,
1974); this is the same language that today is marketed for the
personal computer under the name Resampling Stats. More about
the computer and resampling in the next chapter.
ON THE NATURE OF RESAMPLING TESTS
Resampling is a much simpler intellectual task than the
formulaic method because simulation obviates the need to calcu-
late the number of points in the entire sample space. This
subject is explored in Chapter 00.
REPEAT 1000
GENERATE 82 1,2 A Generate randomly 82 1s (males) or 2s
COUNT A =1 B Count the males
SCORE B Z Keep score of trial results
END
HISTOGRAM Z
COUNT Z >= 61 K
+
+ *
+ *
+ ***
75+ * ****
F + *******
r + *******
e + *******
q + ***********
u 50+ ***********
e + ***********
n + ************
c + **************
y + ****************
25+ ****************
+ ****************
+ ******************
+ ******************
+ *********************** *
0+-------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
20 30 40 50 60
Number of "male-predominant" years
K = 0
Figure III-3-1
NUMBERS (482 529 489 495 455 490 525 530 429 485 454
475 485 485 450 475 479 485 479 495 495 475 520 510 480 429) A
NUMBERS (465 455 411 415 420 455 380 400 419 475 474
450 410 400 505 420) B
CONCAT A B C Join the two vectors of data
REPEAT 1000 Repeat 1000 simulation trials
SHUFFLE C D Shuffle the 42 state prices
TAKE D 1,26 E Take 26 for the "private" group
TAKE D 27,42 F Take the other 16 for the
"monopoly" group
MEAN E EE Find the mean of the "private" group.
MEAN F FF Mean of the "monopoly" group
SUBTRACT EE FF G Difference in the means
SCORE G Z Keep score of the trials
END
HISTOGRAM Z Graph of simulation results to compare
with the observed result
75+
F + *
r + ******
e + ******
q + *******
u 50+ * ********
e + * ********* *
n + *************
c + **************
y + * **************
25+ ***************** *
+ ********************
+ **********************
+ ************************
+ *******************************
0+-------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
-40 -20 0 20 40
Difference in average prices (cents)
(Actual difference = $0.49)
Figure III-3-2
URN 31#1 574#2 men An urn called "men" with 31 "1s"
(=infarctions) and 574 "2s"
(=no infarction)
SAMPLE 135 men high Sample (with replacement!) 135
of the numbers in this urn, give
this group the name "high"
SAMPLE 470 men low Same for a group of 470, call
it "low"
COUNT high =1 a Count infarctions in first group
DIVIDE a 135 aa Express as a proportion
COUNT low =1 b Count infarctions in second
group
DIVIDE b 470 bb Express as a proportion
SUBTRACT aa bb c Find the difference in
infarction rates
SCORE c z Keep score of this difference
END
HISTOGRAM z
COUNT z >=.029 k How often was the resampled
difference >= the observed
difference?
DIVIDE k 1000 kk Convert this result to a
proportion
PRINT kk
200+
+
+
F +
r +
e 150+
q +
u +
e +
n + **
c 100+ **
y + ** ***
+ ** ***
* + ******
+ ****** *
Z 50+ ***********
+ ***********
+ ************ **
+ ****************
+ *********************
0+-------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
-0.1 -0.05 0 0.05 0.1
Difference between paired resamples
(proportion with infarction)
kk = 0.102 (the proportion of resample pairs with a
difference >= .029)
Figure III-1-4: Test for Differences in Infarctions
URN 10#1 125#0 men An urn (called "men") with
ten "1s" (infarctions)
and 125 "0s" (no infarction)
REPEAT 1000 Do 1000 trials
SAMPLE 135 men a Sample (with replacement) 135
numbers from the urn, put them in
"a"
COUNT a =1 b Count the infarctions
DIVIDE b 135 c Express as a proportion
SCORE c z Keep score of the result
END End the trial, go back and repeat
HISTOGRAM z Produce a histogram of all trial
results
PERCENTILE z (2.5 97.5) k Determine the 2.5th and 97.5th
percentiles of all trial results;
these points enclose 95% of the
results
PRINT k
F +
r +
e 150+
q + *
u + * *
e + * **
n + ** **
c 100+ ** ** *
y + * ** ** *
+ * ** ** **
* + * ** ** **
+ * ** ** **
Z 50+ * ** ** **
+ * ** ** ** ** **
+ * ** ** ** ** ** *
+ ** ** ** ** ** ** *
+ ** ** ** ** ** ** ** *
0+-------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
0 0.05 0.1 0.15 0.2
Proportion with infarction
k = 0.037037 0.11852
(This is the 95% confidence interval, enclosing 95% of the resam-
ple results)
Figure III-1-5: Confidence Interval Around Mean
Difference
URN 10#1 125#0 high The universe of 135 high cholesterol
men, 10 of whom ("1s") have infarc-
tions
URN 21#1 449#0 low The universe of 470 low cholesterol
men, 21 of whom ("1s") have infarc-
tions
REPEAT 1000 Repeat the steps that follow 1000
times
SAMPLE 135 high high$ Sample 135 (with replacement) from
the high cholesterol universe, and
put them in "high$" [the "$"
suffix just indicates a resampled
counterpart to the actual sample]
SAMPLE 470 low low$ Similarly for 470 from
the low cholesterol universe
COUNT high$ =1 a Count the infarctions in the first
resampled group
DIVIDE a 135 aa Convert to a proportion
COUNT low$ =1 b Count the infarctions in the second
resampled group
DIVIDE b 470 bb Convert to a proportion
DIVIDE aa bb c Divide the proportions to calculate
relative risk
SCORE c z Keep score of this result
END End the trial, go back and repeat
HISTOGRAM z Produce a histogram of trial results
PERCENTILE z (2.5 97.5) k Find the percentiles that
bound 95% of the trial results
PRINT k
F + *
r + *
e 75+ * *
q + * *
u + * *
e + **** * *
n + **** * *
c 50+ * ****** *
y + * *********
+ ************
* + ************* *
+ **************** *
Z 25+ **************** *
+ ********************
+ * ******************** *
+ *********************** * *
+ ****************************** * * *
0+---------------------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
0 1 2 3 4 5 6
Relative risk
Results (estimated 95% confidence interval):
k = 0.68507 3.3944
Figure III-1-6: Confidence Interval Around Relative Risk
NUMBERS (6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7
8.6) treat
NUMBERS (6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2
6.8) control
CONCAT treat control all Combine all birthweight observa-
tions in same vector
REPEAT 1000 Do 1000 simulations
SAMPLE 15 all treat$ Take a resample of 15 from all
birthweights (the $ indicates
a resampling counterpart to a
real sample)
SAMPLE 15 all control$ Take a second, similar resample
MEAN treat$ mt Find the means of the two
resamples
MEAN control$ mc
SUBTRACT mt mc dif Find the difference between the
means of the two resamples
SCORE dif z Keep score of the result
END End the simulation experiment,
go back and repeat
HISTOGRAM z Produce a histogram of the
resample differences
COUNT z >= .82 k How often did resample
differences exceed the observed
difference of .82?
F +
r +
e 75+
q +
u +
e +
n + * * * *
c 50+ * * * *** *
y + ***********
+ ************ *
* + ***************
+ ****************
Z 25+ * ******************* *
+ * *********************
+ ** **********************
+ ******************************
+ ***********************************
0+---------------------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
-1.5 -1 -0.5 0 0.5 1 1.5
Resample differences in pounds
Result: Only 1.3% of the pairs of resamples produced means that
differed by as much as .82. We can conclude that the observed
difference is unlikely to have occurred by chance.
Figure III-1-7: Test for Birthweight Differences
REPEAT 1000 Do 1000 simulations
GENERATE 31 (1 2 3) ranks Generate 31 numbers, each
number a "1", "2" or "3", to
simulate random assignment of
ranks 1-3 to the drug/
therapy alternative
MEAN ranks rankmean Take the mean of these 31
SCORE rankmean z Keep score of the mean
END End the simulation, go back
and repeat
HISTOGRAM z Produce a histogram of the
rank means
COUNT z <=1.74 k How often is mean rank better
than 1.74, the observed value?
PRINT k
100+
+ * *
+ * *
F + * * *
r + * ** * *
e 75+ * ** * *
q + ** ** * ** *
u + ** ** * ** *
e + ** ** * ** *
n + ** ** * ** **
c 50+ ** ** * ** **
y + * ** ** * ** ** *
+ * ** ** * ** ** *
* + * * ** ** * ** ** *
+ ** * ** ** * ** ** * *
Z 25+ ** * ** ** * ** ** * **
+ * ** * ** ** * ** ** * **
+ * ** * ** ** * ** ** * ** * *
+ * * ** * ** ** * ** ** * ** * * *
+ * ** ** * ** * ** ** * ** ** * ** * ** *
0+---------------------------------------------------------------
|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
1.4 1.6 1.8 2 2.2 2.4 2.6
Figure III-1-8: Test for Improvement by Combined Depression
Therapy
Table III-1-1
Development of Mycardial Infarction in Framingham After 16 Years
Men Age 35-44, by Level of Serum Cholesterol
Serum cholesterol Developed MI Did not develop MI Total
(mg%)
>250 10 125 135
<=250 21 449 470
Source: Shurtleff, D. The Framingham Study: An Epidemiologic
investigation of Cardiovascular Disease, Section 26. Washington,
DC, U.S. Government Printing Office. Cited in Kahn and Sempos
(1989), p. 61, Table 3-8
Figure III-1-8: Test for Improvement by Combined Depression
Therapy
Table III-1-2
Birthweights in a Clinical Trial to Test a Drug
for Preventing Low Birthweights
Baby Weight (lb)
Patient Treatment group Control group
1 6.9 6.4
2 7.6 6.7
3 7.3 5.4
4 7.6 8.2
5 6.8 5.3
6 7.2 6.6
7 8.0 5.8
8 5.5 5.7
9 5.8 6.2
10 7.3 7.1
11 8.2 7.0
12 6.9 6.9
13 6.8 5.6
14 5.7 4.2
15 8.6 6.8
Source: Rosner, Table 8.7
Figure III-1-8: Test for Improvement by Combined Depression
Therapy
Table III-1-3
Observed Rank of Depression Treatments, by Effectiveness
(Hypothetical)
Treatment
Triplet Group Drug Therapy Drug/Therapy
1 3 1 2
2 2 3 1
3 1 3 2
. . . .
. . . .
. . . .
31 2 1 3
Avg. rank 2.29 1.98 1.74
Figure III-1-8: Test for Improvement by Combined Depression
Therapy
ENDNOTES
**ENDNOTES**
<1>: The following biostatistical examples are joint work
with Peter C. Bruce.
Figure III-1-8: Test for Improvement by Combined Depression
Therapy