THE BASIC TECHNIQUES OF RESAMPLING
Julian L. Simon
John Arbuthnot, doctor to Queen Anne of England, began the
publication of formal statistical inference in 1710. He observed
that more boys than girls are born, which he assumed is necessary
for the survival of the species, and he wished to prove that
birth sex is indeed not a 50-50 probability. The records for
London showed that male births exceeded female 82 years in a row.
Arbuthnot therefore set forth to (in modern language) test the
hypothesis that a universe with a 50-50 probability of producing
males could result in 82 successive years with preponderantly
male births.
This is a canonical problem. You have some observed
"sample" data, and you want to connect them to some specified
"population" from which they may have come. The previous
sentence was purposely worded vaguely because statistical
questions can be stated in many different ways. But in this case
statisticians agree on how to proceed: Specify the universe, and
compare its behavior against the observed sample. If it is
unlikely that a sample as surprising as the observed sample
should come from the specified universe, conclude that the sample
did not come from that universe.
Arbuthnot used the multiplication rule of Pascal and Fermat
to calculate that the probability of (1/2)82 is extremely small.
"From whence it follows, that it is Art, not Chance, that
"governs" - that is, "Divine Providence". (His argument is
complex and debatable, as statistical inference often is; the
mathematics is the easy part, especially when resampling methods
are used.)
Please notice that Arbuthnot could have considered the
numbers of boys and girls observed in each year, rather than
treating each year as a single observation - an even stronger
test because of the vast amounts of information. Arbuthnot
surely did not analyze the data for any or all of the individual
years because the calculus of probability was still in its
infancy.
Luckily, the test Arbuthnot made was more than powerful
enough for his purposes. But if instead of 82 years in a row,
only (say) 81 or 61 of the 82 years had shown a preponderance of
males, Arbuthnot would have lacked the tools for a test (though
he knew the binomial and logarithms). Nowadays, one
conventionally uses the Gaussian (Normal) approximation to the
binomial distribution to produce the desired probability. But
that method requires acquaintance with a considerable body of
statistical procedure, and utilizes a formula that almost no one
knows and even fewer can explain intuitively. Instead, users
simply "plug in" the data to a table which, because it is an
arcane mystery, invites misuse and erroneous conclusions.
The experimental resampling method of earlier gamblers could
easily have given Arbuthnot a satisfactory answer for (say) 61 of
82 years, however. He had in fact likened the situation to a set
of 82 coins. He could simply have tossed such a set repeatedly,
and found that almost never would as many as 81 or 61 heads
occur. He could then have rested as secure in his conclusion as
with the formulaic assessment of the probability of 82 years in a
row. And because of the intuitive clarity of the experimental
method, one would not be likely to make a misleading error in
such a procedure.
By the grace of the computer, such problems can be handled
more conveniently today. The self-explanatory commands in
Illustration 3 suffice, using the language RESAMPLING STATS and
producing the results shown there.
Illustration 3
The intellectual advantage of the resampling method is that
though it takes repeated samples from the sample space, it does
not require that one know the size of the sample space or of a
particular subset of it. To estimate the probability of getting
(say) 61 males in 82 births with the binomial formula requires
that one calculate the number of permutations of a total of 82
males and females, and the number of those permutations that
include 61 or more males. In contrast, with a resampling
approach one needs to know only the conditions of producing a
single trial yielding a male or female. This conceptual
difference, which will be discussed at greater length below, is
the reason that, compared to conventional methods, resampling is
likely to have higher "statistical utility" - a compound of
efficiency plus the chance that the ordinary scientist or
decision-maker will use a correct procedure.
VARIETIES OF RESAMPLING METHODS
A resampling test may be constructed for every case of
statistical inference - by definition. Every real-life situation
can be modeled by symbols of some sort, and one may experiment
with this model to obtain resampling trials. A resampling method
should always be appropriate unless there are insufficient data
to perform a useful resampling test, in which case a conventional
test - which makes up for the absence of observations with an
assumed theoretical distribution such as the Normal or Poisson -
may produce more accurate results if the universe from which the
data are selected resembles the chosen theoretical distribution.
Exploration of the properties of resampling tests is an active
field of research at present.
For the main tasks in statistical inference - hypothesis
testing and confidence intervals - the appropriate resampling
test often is immediately obvious. For example, if one wishes to
inquire whether baseball hitters exhibit behavior that fits the
notion of a slump, one may simply produce hits and outs with a
random-number generator adjusted to the batting average of a
player, and then compare the number of simulated consecutive
sequences of either hits or outs with the observed numbers for
the player. The procedure is also straightforward for such
binomial situations as the Arbuthnot birth-sex case.
Two sorts of procedures are especially well-suited to
resampling: 1) A sample of the permutations in Fisher's "exact"
test (confusingly, also called a "randomization" test). This is
appropriate when the size of the universe is properly assumed to
be fixed, as discussed below. 2) The bootstrap procedure. This
is appropriate when the size of the universe is properly assumed
not to be fixed.
Let's compare the permutation and bootstrap procedures in
the context of a case which might be analyzed either way. The
discussion will highlight some of the violent disagreements in
the philosophy of statistics which the use of resampling methods
frequently brings to the surface - one of its great benefits.
In the 1960s I studied the price of liquor in the sixteen
"monopoly" states (where the state government owns the retail
liquor stores) compared to the twenty-six states in which retail
liquor stores are privately owned. (Some states were omitted for
technical reasons. The situation and the price pattern has
changed radically since then.)
These were the representative 1961 prices of a fifth of
Seagram 7 Crown whiskey in the two sets of states:
16 monopoly states: $4.65, $4.55, $4.11, $4.15,
$4.20, $4.55, $3.80, $4.00, $4.19, $4.75, $4.74,
$4.50, $4.10, $4.00, $5.05, $4.20
26 private-ownership states: $4.82, $5.29,
$4.89, $4.95, $4.55, $4.90, $5.25, $5.30, $4.29,
$4.85, $4.54, $4.75, $4.85, $4.85, $4.50, $4.75,
$4.79, $4.85, $4.79, $4.95, $4.95, $4.75, $5.20,
$5.10, $4.80, $4.29.
The economic question that underlay the investigation -
having both theoretical and policy ramifications - is as
follows: Does state ownership affect prices? The empirical
question is whether the prices in the two sets of states were
systematically different. In statistical terms, we wish to test
the hypothesis that there was a difference between the groups of
states related to their mode of liquor distribution, or whether
instead the observed $.49 differential in means might well have
occurred by happenstance. In other words, we want to know
whether the two sub-groups of states differed systematically in
their liquor prices, or whether the observed pattern could well
have been produced by chance variability.
At first I used a resampling permutation test as follows:
Assuming that the entire universe of possible prices consists of
the set of events that were observed, because that is all the
information available about the universe, I wrote each of the
forty-two observed state prices on a separate card. The
shuffled deck simulated a situation in which each state has an
equal chance for each price.
On the "null hypothesis" that the two groups' prices do not
reflect different price-setting mechanisms, but rather differ
only by chance, I then examined how often that simulated
universe stochastically produces groups with results as
different as observed in 1961. I repeatedly dealt groups of 16
and 26 cards, without replacing the cards, to simulate
hypothetical monopoly-state and private-state samples, each time
calculating the difference in mean prices.
The probability that the benchmark null-hypothesis universe
would produce a difference between groups as large or larger
than observed in 1961 is estimated by how frequently the mean of
the group of randomly-chosen sixteen prices from the simulated
state-ownership universe is less than (or equal to) the mean of
the actual sixteen state-ownership prices. If the simulated
difference between the randomly-chosen groups was frequently
equal to or greater than observed in 1961, one would not
conclude that the observed difference was due to the type of
retailing system because it could well have been due to chance
variation.
The computer program in Illustration 4, using the language
RESAMPLING STATS performs the operations described above
(MATHEMATICA and APL could be used in much the same fashion).
Illustration 4
The results shown - not even one "success" in 10,000 trials
- imply a very small probability that two groups with mean
prices as different as were observed would happen by chance if
drawn from the universe of 42 observed prices. So we "reject
the null hypothesis" and instead find persuasive the proposition
that the type of liquor distribution system influences the
prices that consumers pay.
As I shall discuss later, the logical framework of this
resampling version of the permutation test differs greatly from
the formulaic version, which would have required heavy
computation. The standard conventional alternative would be a
Student's t-test, in which the user simply plugs into an
unintuitive formula and table. And because of the unequal
numbers of cases and unequal dispersions in the two samples, an
appropriate t test is far from obvious, whereas resampling is
not made more difficult by such realistic complications.
Recently I have concluded that a bootstrap-type test has
better theoretical justification than a permutation test in this
case, though the two reach almost identical results with a
sample this large. The following discussion of which is most
appropriate brings out the underlying natures of the two
approaches, and illustrates how resampling raises issues which
tend to be buried amidst the technical complexity of the
formulaic methods, and hence are seldom discussed in print.
Imagine a class of 42 students, 16 men and 26 women who
come into the room and sit in 42 fixed seats. We measure the
distance of each seat to the lecturer, and assign each a rank.
The women sit in ranks 1-5, 7-20, etc., and the men in ranks 6,
22, 25-26, etc. You ask: Is there a relationship between sex
and ranked distance from the front? Here the permutation
procedure that resamples without replacement - as used above
with the state liquor prices - quite clearly is appropriate.
Now, how about if we work with actual distances from the
front? If there are only 42 seats and they are fixed, the
permutation test and sampling without replacement again is
appropriate. But how about if seats are movable?
Consider the possible situation in which one student can
choose position without reference to others. That is, if the
seats are movable, it is not only imaginable that A would be
sitting where B now is, with B in A's present seat - as was the
case with the fixed chairs - but A could now change distance
from the lecturer while all the others remain as they are.
Sampling with replacement now is appropriate. (To use a
technical term, the cardinal data provide more actual degrees of
freedom - more information - than do the ranks).
Note that (as with the liquor prices) the seat distances do
not comprise an infinite population. Rather, we are inquiring
whether a) the universe should be considered limited to a given
number of elements, or b) could be considered expandable without
change in the probabilities; the latter is a useful definition
of "sampling with replacement".
As of 1992, the U.S. state liquor systems seem to me to
resemble a non-fixed universe (like non-fixed chairs) even
though the actual number of states is presently fixed. The
question the research asked was whether the liquor system
affects the price of liquor. We can imagine another state being
admitted to the union, or one of the existing states changing
its system, and pondering how the choice of system will affect
the price. And there is no reason to believe that (at least in
the short run) the newly-made choice of system would affect the
other states' pricing; hence it makes sense to sample with
replacement (and use the bootstrap) even though the number of
states clearly is not infinite or greatly expandable.
In short, the presence of interaction - a change in one
entity causing another entity also to change - implies a finite
universe composed of those elements, and use of a permutation
test. Conversely, when one entity can change independently, an
infinite universe and sampling with replacement with a bootstrap
test is indicated.
A program to handle the liquor problem with an infinite-
universe bootstrap distribution simply substitutes the random
sampling command GENERATE for the TAKE command in Illustration
4. The results of the new test are indistinguishable from those
in Illustration 4.
Confidence Intervals
So far we have discussed the interpretation of sample data
for testing hypotheses. The devices used for the other main
theme in statistical inference - the estimation of confidence
intervals - are much the same as those used for testing
hypotheses. Indeed, the bootstrap method discussed above was
originally devised for estimation of confidence intervals. The
bootstrap method may also be used to calculate the appropriate
sample size for experiments and surveys, another important topic
in statistics.
OTHER RESAMPLING TECHNIQUES
We have so far seen examples of three of the most common
resampling methods - binomial, permutation, and bootstrap.
These methods may be extended to handle correlation, regression,
and tests where there are three or more groups. Indeed,
resampling can be used for every other statistic in which one
may be interested - for example, statistics based on absolute
deviations rather than squared deviations. This flexibility is
a great virtue because it frees the statistics user from the
limited and oft-confining battery of textbook methods.
ON THE NATURE OF RESAMPLING TESTS
As will be discussed at more length in Chapter 00,
resampling is a much simpler intellectual task than the
formulaic method because simulation obviates the need to
calculate the number of points in the entire sample space. In
all but the most elementary problems where simple permutations
and combinations suffice, the calculations require advanced
training and delicate judgment.
Resampling avoids the complex abstraction of sample-space
calculations by substituting the particular information about
how elements in the sample are generated randomly in a specific
event, as learned from the actual circumstances; the analytic
method does not use this information. In the case of the
gamblers prior to Galileo, resampling used the (assumed) facts
that three fair dice are thrown with an equal chance of any
outcome, and they took advantage of experience with many such
events performed one at a time; in contrast, Galileo made no use
of the actual stochastic element of the situation, and gained no
information from a sample of such trials, but rather replaced
all possible sequences by exhaustive computation.
The analytic method for obtaining solutions - using
permutation and combination formulas, for example - is not
theoretically superior to resampling. Resampling is not "just"
a stochastic-simulation approximation to the formulaic method.
It is a quite different route to the same endpoint, using
different intellectual processes and utilizing different sorts
of inputs; both resampling and formulaic calculation are
shortcuts to estimation of the sample space and its partitions.
The much lesser degree of intellectual difficulty is the
source of the central advantage of resampling. It improves the
probability that the user will arrive at a sound solution to a
problem - the ultimate criterion for all except for pure
mathematicians.
A common objection is that resampling is not "exact"
because the results are "only" a sample. Ironically, the basis
of all statistics is sample data drawn from actual populations.
Statisticians have only recently managed to win most of their
battles against those bureaucrats and social scientists who, out
of ignorance of statistics, believed that only a complete census
of a country's population, or examination of every volume in a
library, could give satisfactory information about unemployment
rates or book sizes. Indeed, samples are sometimes even more
accurate than censuses. Yet many of those same statisticians
have been skittish about simulated samples of data points taken
from the sample space - drawn far more randomly than the data
themselves, even at best. They tend to want a complete "census"
of the sample space, even when sampling is more likely to arrive
at a correct answer because it is intellectually simpler (as
with the gamblers and Galileo.)
If there is legitimate concern about whether there are
enough repetitions in a resampling procedure, the matter can be
handled in exactly the same fashion as sample size is handled
with respect to the actual data. One may compute the amount of
error associated with various numbers of repetitions. And at
very low cost of computer time this error may be reduced until
it is vanishingly small compared with the sampling error
associated with the actual sampling process. (Research on how
to do this precisely is needed, however.)
page # teachbk II-2meth May 7, 1996