CHAPTER I-2
STATISTICAL INFERENCE AND RANDOM SAMPLING
Continuity and sameness is the first fundamental concept in
inference in general, as discussed in Chapter I-1. Random
sampling is the second of the great concepts in inference, and it
distinguishes probabilistic statistical inference from non-
statistical inference as well as from non-probabilistic inference
based on statistical data. When the data of interest are not the
result of random sampling, a sample drawn at random is the ideal
to which the actual sample is compared. And the properties of a
randomly-drawn sample are utilized on the assumption that the
actual sample is sufficiently close to the ideal.<1>
The usual goal of a statistical inference is a decision
about which of two or more hypotheses one will thereafter choose
to believe and act upon. The strategy is to consider the
behavior of a given universe in terms of the samples it is likely
to produce, and if the observed sample is not a likely outcome we
then proceed as if the sample did not in fact come from that
universe. (The previous sentence is a restatement in somewhat
different form of the core of statistical analysis.)
At a more technical level now: Probably the most important
task of statistical inference is to determine the existence (or
extent) of sameness when intuition alone does not provide a
satisfactory answer. Two common cases are a) the extent of
overlap between two distributions, and b) the probability that a
sample should be said to be the same as a universe in the sense
of having been drawn from it. The statistical inference may be
thought of as an operational specification that makes more
precise a previously-vague notion about sameness.
Let's begin the discussion with a simple though unrealistic
situation. Your friend Arista a) looks into a cardboard carton,
b) reaches in, c) pulls out her hand, and d) shows you a green
ball. What might you reasonably infer?
You might at least be fairly sure that the green ball came
from the carton, though you recognize that Arista might have had
it concealed in her hand when she reached into the carton. But
there is not much more you might reasonably conclude at this
point except that there was at least one green ball in the carton
to start with. There could be no more balls; there could be many
green balls and no others; there could be a thousand red balls
and just one green ball; and there could be one green ball, a
hundred balls of different colors, and two pounds of mud - given
that she looked in first, it is not improbable that she picked
out the only green ball among other material of different sorts.
There is not much you could say with confidence about the
likelihood of yourself reaching into the same carton with your
eyes closed and pulling out a single green ball. To use other
language (which some philosophers might say is not appropriate
here because the situation is too specific), there is little
basis for induction about the contents of the box. Nor is the
situation very different if your friend three times in a row
reaches in and then hands you a green ball each time.
So far we have put our question rather vaguely. Let us
frame a more precise inquiry: What do we predict about the next
item(s) we might draw from the carton? If we assume - based on
who-knows-what information or notions - that another ball will
emerge, we could simply use the principle of sameness and (until
we see a ball of another color) predict that the next ball will
be green, whether one or three or 100 balls is (are) drawn.
But now what about if Arista pulls out nine green balls and
one red ball? The principle of sameness cannot be applied as
simply as before. Based on the last previous ball, the next one
will be red. But taking into account all the balls we have seen,
the next will "probably" be green. We have no solid basis on
which to go further. There cannot be any "solution" to the
"problem" of reaching a general conclusion on the basis of these
specific pieces of evidence.
Now consider what you might conclude if you were told that
a single green ball had been drawn with a random sampling
procedure from a box containing nothing but balls. Knowledge
that the sample was drawn randomly from a given universe is
grounds for belief that one knows much more than if a sample were
not drawn randomly. First, you would be sure - if you had
reasonable basis to believe that the sampling really was random,
which is not easy to guarantee - that the ball came from the box.
Second, you would guess that the proportion of green balls is not
very small, because if there are only a few green balls and many
other-colored balls, it would be unusual - that is, the event
would have a low probability - to draw a green ball. Not
impossible, but unlikely. And we can compute the likelihood of
drawing a green ball - or any other combination of colors - for
different assumed compositions within the box. So the knowledge
that the sampling process is random greatly increases our ability
- or our confidence in our ability - to infer the contents of the
box.
Let us note well the strategy of the previous paragraph:
Ask about the probability that one or more various possible
contents of the box (the "universe") will produce the observed
sample, on the assumption that the sample was drawn randomly.
This is the central strategy of all statistical inference, though
I do not find it so stated elsewhere. We shall come back to this
idea shortly.
There are several kinds of questions one might ask about the
contents of the box. One general category includes questions
about our best guesses of the box's contents - that is, questions
of estimation; another category includes questions about our
surety of that description, and our surety that the contents are
similar or different from the contents of other boxes. The
estimation questions can be subtle and unexpected (Savage,
1915/1972, Chapter 15), but do not cause major controversy about
the foundations of statistics. Hence I shall merely mention that
the method of moments and the method of maximum likelihood serve
most of our needs, and often agree in their conclusions;
furthermore, we often know when the former may be inappropriate.
So we can quickly move on to questions about the extent of surety
in our estimations.
Consider your reaction if the sampling produces 10 green
balls in a row, or 9 out of 10. If you had no other information
(a very important assumption that we will leave aside for now),
your best guess would be that the box contains all green balls,
or a proportion of 9 of 10, in the two cases respectively. This
estimation process seems natural enough.
You would be surprised if someone told you that instead of
the box containing the proportion in the sample, it contained
just half green balls. How surprised? Intuitively, the extent
of your surprise would depend on the likelihood that a half-green
"universe" would produce 10 or 9 green balls out of 10. This
surprise is a key element in the logic of the hypothesis-testing
branch of statistical inference.
We learn more about the likely contents of the box by asking
about the probability that various specific populations of balls
within the box would produce the particular sample that we
received. That is, we can ask how likely a collection of 25
percent green balls is to produce (say) 9 of 10 greens, and how
likely collections of 50 percent green, 75 percent green, 90
percent green (and any other collections of interest) are to
produce the observed sample. That is, we ask about the
consistency between any particular hypothesized collection within
the box and the sample we observe. And it is reasonable to
believe that those universes which have greater consistency with
the observed sample - that is, those universes that are more
likely to produce the observed sample - are more likely to be in
the box than other universes.
What we have just one (to repeat, as I shall repeat many
times) is the basic strategy of statistical investigation. If we
observe 9 of 10 green balls, we then determine that universes
with (say) 9/10 and 10/10 green balls are more consistent with
the observed evidence than are universes of 0/10 and 1/10 green
balls. So by this process of considering specific universes that
the box might contain, we make possible more specific inferences
about the box's contents based on the sample evidence than we
could without this process.
Please notice the role of the concept of probability and the
atcual assessment of probabilities here: By one technical means
or another (either resampling or formulas), we assess the
probabilities that a particular universe will produce the
observed sample, and other samples as well.
It is of the highest importance to recognize that without
additional knowledge (or assumption) one cannot make any
statements about the probability of the sample having come from
any particular universe, on the basis of the sample evidence.
(Better read that last sentence again.) We can only speak about
the probability that a particular universe will produce (in
contrast to did produce) the observed sample, a very different
matter. This issue will arise again very sharply in the context
of confidence intervals.
Let us generalize the steps in statistical inference:
1. Frame the original question as: What is the chance of
getting the observed sample s from population S? That is, what
is probability of (If s then S)?
2. Proceed to this question: What kinds of samples does the
postulated[<2> universe S produce, with which probability? That
is, what is the probability of this particular s coming from S?
That is, what is p(s!S)?
3. Actually investigate the behavior of S with respect to s
and other samples. One can do this in two ways:
a. One can use the calculus of probability, perhaps
resorting to Monte Carlo methods if an appropriate formula does
not exist. Or,
b. Or one can use resampling (in the larger sense); the
domain resampling is meant here to equal all Monte Carlo
experimentation except for the use of Monte Carlo methods for i)
approximations, ii) investigation of complex functions in
statistics and other theoretical mathematics, and iii) uses
elsewhere in science. Resampling in its more restricted sense
includes i) the bootstrap, ii) permutation tests, and iii) other
non-parametric simulation methods of statistics.
4. Interpretation of the probabilities that result from
step 3 in terms of i) acceptance or rejection of hypotheses, ii)
surety of conclusions, or iii) inputs to decision theory.
Here is the short definition of statistical inference: The
selection of a probabilistic model that might resemble the
process you wish to investigate, the investigation of that
model's behavior, and the interpretation of the results.
We will get even more specific about the procedure when we
discuss the canonical procedures for hypothesis testing and for
the finding of confidence intervals in the chapters on those
subjects.
The discussion so far has been in the spirit of what is
known as hypothesis testing. The result of a hypothesis test is
a decision about whether or not one believes that the sample is
likely to have come from the "benchmark [postulated] universe" S.
The logic is that if the probability of such a sample coming from
that universe is low, we will then choose to believe the
alternative - to wit, that the sample came from the universe that
resembles the sample. The underlying idea is that if an event
would be very surprising if it really happened - as it would be
very surprising if the dog had really eaten the homework - we are
inclined not to believe in that possibility. (This logic will be
explored further in Chapter 00 on hypothesis testing).
We have so far assumed that our only relevant knowledge is
the sample. And though we almost never lack some additional
information, this can be a sensible way to proceed when we wish
to suppress any other information or speculation. This
suppression is controversial; those known as Bayesians or
subjectivists want us to take into account all the information we
have. But even they would not dispute suppressing information in
certain cases - such as a teacher who does not want to know
students' SAT scores because s/he might want avoid the
possibility of unconsciously being affected by that score, or by
an employer who wants not to know the potential employee's ethnic
or racial background even though it might improve the hiring
process, or by a sports coach who refuses to pick the starting
team each year until the players have competed for the positions.
If the Bayesians will admit the reasonability of suppressing
information in at least some situations, it will be a major step
in accommodation and in bringing all views into greater harmony.
(More about this topic in Chapter 00).
Now consider a variant on the green-ball situation discussed
above. Assume that you are told that there is a (say) equal
probability of the sample of nine green and one red balls being
drawn from one of two specified universes - for example, two urns
of balls, one with 50 percent green balls and the other with 80
percent green balls. On the basis of your sample you can then
say how probable it is that the sample came from one or the
other. You proceed by computing the probabilities (often called
the likelihoods in this situation) that each of those two
universes would individually produce the observed samples -
probabilities that you could arrive at with resampling, with
Pascal's Triangle, or with a table of binomial probabilities, or
with the Normal approximation and the Z distribution, or yet
other devices. Those probabilities are .01 and .27, and the
ratio of the two is between .03 and .04. That is, fair betting
odds are about 1 to 27.<3>
Actual situations that fit this Neyman-Pearson model are not
frequently found. Let us consider a genetics problem on this
model. Plant A produces 3/4 black seeds and 1/4 reds; plant B
produces all reds. You get a red seed. Which plant would you
guess produced it? You surely would guess plant B. Now, how
about 9 reds and a black, from Plants A and C, the latter
producing 50 percent reds on average?
To put the question more precisely: What betting odds would
you give that the one red seed came from plant B? Let us reason
this way: If you do this again and again, 4 of 5 of the red
seeds you see will come from B. Therefore, reasonable (or
"fair") odds are 4 to 1, because this is in accord with the
ratios with which red seeds are produced by the two plants - 4/4
to 1/4.
How about the sample of 9 reds and a black, and plants A and
C? It would make sense that the appropriate odds would be
derived from the probabilities of the two plants producing that
particular sample, probabilities which we computed above.
Now let us move to a bit more complex problem: Consider two
urns - urn G with 2 red and 1 black balls, and urn H with 100 red
and 100 black balls. Someone flips a coin to decide which urn
will be drawn from, reaches into that urn, and chooses two balls
without replacing the first one before drawing the second. Both
are red. What are the odds that the sample came from urn G?
Clearly, the answer should derive from the probabilities that the
two urns would produce the observed sample.<4>
Let's restate the central issue. One can assess the
probability that a particular plant which produces on average 1
red and 3 black seeds will produce one red seed, or 5 reds among
a sample of 10. But without further assumptions - such as the
assumption above that the possibilities are limited to two
specific universes - one cannot say how likely a given red seed
is to have come from a given plant, even if we know that that
plant produces only reds. (For example, it may have come from
other plants producing only red seeds.)
When we limit the possibilities to two universes (or to a
larger set of specified universes) we are able to put a
probability on one hypothesis or another. But to repeat, in many
or most cases, one cannot reasonably assume it is one or the
other. And then we cannot state any odds that the sample came
from a particular universe. This is a very difficult point to
grasp, experience shows, but a crucial one. (It is the sort of
subtle issue that makes statistics so difficult.)
The additional assumptions necessary to talk about the
probability that the red seed came from a given plant are the
stuff of statistical inference. And they must be combined with
such "objective" probabilistic assessments as the likelihood that
a 1-red-3-black plant will produce one red, or 5 reds of 10.
Now let us move one step further. Instead of stating as a
fact under our control that there is a .5 chance of the sample
being drawn from each of the two urns in the problem above, let
us assume that we do not know the probability of each urn being
picked, but instead we estimate a probability of .5 for each urn,
based on a variety of other information that all is uncertain.
But though the facts are now different, the most reasonable
estimate of the odds that the observed sample was drawn from one
or the other urn will still be the same - because in both cases
we were working with a "prior probability" of .5. (The term
"prior probability" is Bayesian.) And when we view the situation
this way, the Neyman-Pearson model may be seen perfectly well in
a Bayesian framework.
Now let us go a step further by allowing the universes from
which the sample may have come to have different assumed
probabilities as well as different compositions. That is, we now
consider prior probabilities other than .5.
It was the contribution of Thomas Bayes that he showed how
to formally incorporate into a computation the "prior"
information (which we may choose to call speculation or belief)
about the probabilities of drawing from the urns so as to derive
a "posterior" probability. But in some or many cases, it is not
possible to specify anything further about the "prior
distribution" - not even to assume that all possibilities over a
given range are of equal probability - and in such a case, you
cannot make any reasonable statement about the probability of one
or another population based on the sample alone. (People known
as "strict Bayesians" say that it is always possible to make
meaningful statements about the prior distributions. Whether one
can or cannot do so in a particular case seems to me an issue of
judgment, however.)
How do we decide which universe(s) to investigate for the
likelihood of producing the observed sample, as well as producing
samples that are even less likely, in the sense of being more
surprising? That judgment depends upon the purpose of your
analysis, upon your point of view of how statistics ought to be
done, and upon some other factors. This decision is discussed in
Section 00.
It should be noted that the logic described so far applies
in exactly the same fashion whether we do our work estimating
probabilities with the resampling method or with conventional
methods. We can figure the probability of nine or more green
chips from a universe of (say) p = .7 with either approach.
So far we have discussed the comparison of various
hypotheses and possible universes. We must also mention where
the consideration of the reliability of estimates comes in. This
leads to the concept of confidence limits, which will be
discussed in Chapter 00.
Samples Whose Observations May Have More Than Two Values
So far we have discussed samples and universes that we can
characterize as proportions of elements which can have only one
of two characteristics - green or other, in this case, which is
equivalent to "1" or "0". This expositional choice has been
solely for clarity. All the ideas discussed above pertain just
as well to samples whose observations may have more than two
values, and that may be either discrete or continuous.
SUMMARY AND CONCLUSIONS
A statistical question asks about the probabilities of
possible generating universes in light of the evidence of a
sample. In every case, the statistical answer comes from
considering the behavior of particular specified universes in
relation to the sample evidence and to the behavior of other
possible universes. That is, a statistical problem is an
exercise in postulating universes of interest and interpreting
the probabilistic distributions of results of those universes.
The preceding sentence is the key operational idea in statistical
inference, though I do not seem a find a statement like this one
in the literature.
Different sorts of realistic contexts call for different
ways of framing the inquiry. For each of the established models
there are types of problems that that model fits better than do
the other models, and other types of problems for which the model
is quite inappropriate. Limiting the domain of application in
this fashion, together with using the operational definition of
probability discussed in Chapter 00, removes the apparent
conflicts between the Fisherian, Neyman-Pearson, and Bayesian
models of statistical inference.
Fundamental wisdom in statistics, as in all other contexts,
is to carry and use a large tool kit rather than just applying
only a hammer, screwdriver, or wrench no matter what the problem
is at hand. (Philosopher Abraham Kaplan once stated Kaplan's Law
of scientific method: Give a small boy a hammer and there is
nothing that he will encounter that does not require pounding.)
Studying the text of a poem statistically to infer whether
Shakespeare or Bacon is the more likely author is quite different
than inferring whether bioengineer Smythe can produce an increase
in the proportion of calves, and both are different from
decisions about whether to remove a basketball player from the
game or choose to produce a new product.
Some key points: 1) In statistical inference as in all
sound thinking, one's purpose is central. All judgments should
be made relative to that purpose, and in light of costs and
benefits. (This is the spirit of the Neyman-Pearson approach).
2) One cannot avoid making judgments; the process of statistical
inference cannot ever be perfectly routinized or objectified.
Even in science, fitting a model to experience requires judgment.
3) The best ways to infer are different in different situations -
economics, psychology, history, business, medicine, engineering,
physics, and so on. 4) Different tools must be used when the
situations call for them - sequential vs. fixed sampling, Neyman-
Pearson vs. Fisher, and so on. 5) In statistical inference it is
wise not to argue about the proper conclusion when the data and
procedures are ambiguous. Instead, whenever doing so is
possible, one should go back and get more data, hence lessening
the importance of the efficiency of statistical tests. In some
cases one cannot easily get more data, or even conduct an
experiment, as in biostatistics with cancer patients. And with
respect to the past one cannot produce more historical data. But
one can gather more and different kinds of data, e.g. the history
of research on smoking and lung cancer.
ENDNOTES
**ENDNOTES**
<1>: In the course of editing the first two editions of my
text on research methods, my friend the late Hanan Selvin never
ceased to brace me on writing about a "randomly drawn sample"
rather than a random sample, because randomness refers to the
process rather than to the outcome. I still slip occasionally
into the lazy term, however. When I do so, please note that it
is a mistake.
<2>: The postulated universe S bears some likeness to the
Kantian-Einsteinian model created by the researcher against
which to test the observed data. But instead of deriving from
theory or insight or hunch or whatever, in statistical
inference the model derives from the sample (plus perhaps a
Bayesian prior distribution, about which more shortly).
Another difference from the original "scientific" model is
that the postulated universe S has no causal connection to the
sample except through the process of sampling.
Statistical inference resembles the scientific model in that
it is assumed not to be a perfect picture of nature. But unlike
a scientific model, in the case of a finite universe we assume
that larger and larger samples can approach the actual universe.
<3>: Using RESAMPLING STATS, a program to find the
probabilities is as follows. Ask: What is the probability of
drawing a sample of nine green and one red ball from a) a 50/50
universe, and b) a universe that is 80% green, 20% red?
REPEAT 15000
GENERATE 10 1,2 a Let 1= red, 2 = green
COUNT a =1 b
SCORE b z-one
END
COUNT z-one =9 k-one
DIVIDE k-one 15000 kk-one
REPEAT 15000
GENERATE 10 1,10 a
COUNT a <=8 b Let 1-8 = red
SCORE b z-two
END
COUNT z-two =9 k-two
DIVIDE k-two 15000 kk-two
DIVIDE kk-two kk-two k
PRINT kk-two kk-two k
kk-one = 0.0092
kk-two = 0.27247
k = 0.033766
[source: program redball.sta]
<4>: Just for fun, how about if the first ball drawn is
thrown back after examining? What are the appropriate odds now?