CHAPTER II-4
CONFIDENCE INTERVALS II: PROCEDURE WITH EXAMPLES
Here is a checklist for the canonical procedure for
confidence intervals. It follows much the same logic as
presented for testing hypotheses in an earlier chapter. We shall
begin with the binomial example of a political poll, and then
present the "continuous" multi-valued example of tree heights.
The Accuracy of Political Polls
Consider the reliability of a randomly-selected 1988
presidential election poll, showing 840 intended votes for Bush
and 660 intended votes for Dukakis out of 1500 (Wonnacott and
Wonnacott, 1990, p. 5)). Let us work through the logic of this
example.
What is the question? Stated technically, what are the 95%
confidence limits for the proportion of Bush supporters in the
population? (The proportion is the mean of a binomial population
or sample, of course.) More broadly, within which bounds could
one confidently believe that the population proportion was likely
to lie? At this stage of the work, we must have already
translated the conceptual question (in this case, a decision-
making question from the point of view of the candidates) into a
statistical question. (See Chapter II-1 on translating questions
into statistical form.)
What is the purpose to be served by answering this question?
There is no sharp and clear answer in this case. The goal could
be to satisfy public curiosity, or strategy planning for a
candidate (though a national proportion is not as helpful for
planning strategy as state data would be).
Is this a "probability" or a "probability-statistics"
question? The latter; we wish to infer from sample to population
rather than the converse.
Given that this is a statistics question: What is the form
of the statistics question - confidence limits or hypothesis
testing? Confidence limits.
Given that the question is about confidence limits: What is
the description of the sample that has been observed? a) The raw
sample data - the observed numbers of interviewees are 840 for
Bush and 660 for Dukakis - constitutes the best description of
the universe. The statistics of the sample are the given
proportions - 56 percent for Bush, 44 percent for Dukakis.
Which universe? (Assuming that the observed sample is
representative of the universe from which it is drawn, what is
your best guess about the properties of the universe about whose
parameter you wish to make statements? The best guess is that
the population proportion is the sample proportion - that is, the
population contains 56 percent Bush votes, 44 percent Dukakis
votes. Possibilities for Bayesian analysis? Not in this case,
unless you believe that the sample was biased somehow.
Which parameter(s) do you wish to make statements about?
Mean, median, standard deviation, range, interquartile range,
other? We wish to estimate the proportion in favor of Bush (or
Dukakis).
Which symbols for the observed entities? Perhaps 56 green
and 44 yellow balls, if an urn is used, or "0" and "1" if the
computer is used.
Discrete or continuous distribution? In principle,
discrete. (All distributions must be discrete in practice.)
What values or ranges of values? 0-1.
Finite or infinite? Infinite - the sample is small relative
to the population.
If the universe is what you guess it to be, what variation
among which samples do you wish to estimate? A sample the same
size as the observed poll.
Here one may continue either with resampling or with the
conventional method. Everything done up to now would be the same
whether continuing with resampling or with a standard parametric
test.
Conventional Calculational Methods
Estimating the Distribution of Differences Between Sample
and Population Means With the Normal Distribution. In the
conventional approach, one could in principle work from first
principles with lists and sample space, but that would surely be
too cumbersome. One could work with binomial proportions, but
this problem has too big a sample for tree-drawing and quincunx
techniques; even the ordinary textbook table of binomial
coefficients is too small for this job. Calculating binomial
coefficients also is a big job. So instead one would use the
Normal approximation to the binomial formula.
(Note to the non-statistician: The distribution of means
that we manipulate has the Normal shape because of the operation
of the Law of Large Numbers. Sums and averages, when the sample
is reasonably large, take on this shape even if the underlying
distribution is not Normal. This is a truly astonishing property
of randomly-drawn samples - the distribution of their means
quickly comes to resemble a "Normal" distribution, no matter the
shape of the underlying distribution. We then standardize it
with the standard deviation or other device? so that we can state
the probability distribution of the sampling error of the mean
for any sample of reasonable size.)
(The exercise of creating the Normal shape empirically is
simply a generalization of particular cases such as we will later
create here for the poll by resampling simulation. One can also
go one step further and use the formula of de Moivre-Laplace-
Gauss to describe the empirical distributions, and instead of
them. Looking ahead now, the difference between resampling and
the conventional approach can be said to be that in the
conventional approach we simply plot the Gaussian distribution
very carefully, and use a formula instead of the empirical
histograms, afterwards putting the results in a standardized
table so that we can read them quickly without having to re-
create the curve each time we use it. More about the nature of
the Normal distribution may be found in Chapter 00 [Statphil].)
All the work done above uses the information specified
previously - the sample size of 1500, the drawing with
replacement, the observed proportion as the criterion.
Confidence Intervals Empirically - With Resampling
Estimating the Distribution of Differences Between Sample
and Population Means By Resampling
What procedure to produce entities? Random selection from
urn or computer.
Simple (single step) or complex (multiple "if" drawings)?
Simple.
What procedure to produce re-samples? That is, with or
without replacement? With replacement.
Number of drawings observations in actual sample, and hence,
number of drawings in resamples? 1500.
What to record as result of each re-sample drawing? Mean,
median, or whatever of re-sample? The proportion is what we
seek.
Stating the distribution of results: The distribution of
proportions for the trial samples.
Choice of confidence bounds?: 95%, two tails (choice made by
the textbook that posed the problem).
Computation of probabilities within chosen bounds: Read the
probabilistic result from the histogram of results.
Because the theory of confidence intervals is so abstract
(even with the resampling method of computation), let us now walk
through this resampling demonstration slowly, using the
conventional Approach 1 described previously. We first produce a
sample, and then see how the process works in reverse to estimate
the reliability of the sample, using the Bush-Dukakis poll as an
example. The computer program and output may be found in Chapter
00 Howteach
Step 1: Draw a sample of 1500 voters from a universe that,
based on the observed sample, is 56 percent for Bush, 44 percent
for Dukakis. The first such sample produced by the computer
happens to be 53 percent for Bush; it might have been 58 percent,
or 55 percent, or very rarely, 49 percent for Bush.
Step 2: Repeat step 1 perhaps 400 or 1000 times.
Step 3: Estimate the distribution of means (proportions) of
samples of size 1500 drawn from this 56-44 percent Bush-Dukakis
universe; the resampling result is shown in Figure II-4-1
Figure II-4-1
Step 4: In a fashion is similar to what was done in steps 1-
3, now compute the 95 percent confidence intervals for some other
postulated universe mean - say 53% for Bush, 47% for Dukakis.
This step produces a confidence interval that is not centered on
the sample mean and the estimated universe mean, and hence it
shows the independence of the our procedure from that magnitude.
And we now compare the breadth of the mean estimated confidence
intervals for the 5 and 95 percentiles generaqted with the 53-47
percent universe against the corresponding distribution of sample
means generated by the "true" Bush-Dukakis population of 56
percent - 44 percent. If the procedure works well, the results
of the two procedures should be similar.
Now we interpret the results using this first approach. The
histogram shows the probability that the difference between the
sample mean and the population mean - the error in the sample
result - will be (say) 4 percentage points too low. It follows
that about 47.5 percent (half of 95 percent) of the time, a
sample like this one will be between the population mean and 4
percent too low. We do not know the actual population mean. But
for any observed sample like this one, we can say that there is a
47.5 percent that the distance between it and the mean of the
population that generated it is minus four percent or less.
Now a crucial step: We turn around the statement just
above, and say that there is an 47.5 percent chance that the
population mean is less than four percentage points higher than
the mean of a sample drawn like this one, but at or above the
sample mean. (And we do the same for the other side of the
sample mean.)
So to recapitulate: We observe a sample and its mean. We
estimate the error by experimenting with one or more universes in
that neighborhood, and we then give the probability that the
population mean is within that margin of error from the sample
mean.
We can also use Approach 2, which is computationally simply
a short-circuiting of Approach 1 (though the interpretations
differ), as follows:
Step 1: As above.
Step 2: With a hypothetical distribution that is 56 percent
for Bush (the sample estimate) (and in a non-binomial case, with
the dispersion estimated from the sample) generate perhaps 400
samples of size 1500.
Step 3: Find the 95th percentile of the samples in Step 2.
Step 4: Centered at that 95th percentile, generate a
distribution of samples of size 1500 with the population
dispersion assumed the same as in step 2.
Step 5: Find the boundary which includes 95 percent of the
samples. If this boundary is indeed the sample mean, then the
point at which this distribution is centered is indeed the 95
percent confidence interval (as it must be as long as the
dispersion used in all of the universes is the same; they are
just set off from each other algebraically.)
Approach 2 for Counted Data: the Bush-Dukakis Poll
Let's implement Approach 2 for counted data, using for
comparison the Bush-Dukakis poll data discussed earlier in the
context of Approach 1.
We seek to state, for universes that we select on the basis
that their results will interest us, the probability that they
(or it, for a particular universe) would produce a sample as far
or farther away from the mean of the universe in question as the
mean of the observed sample - 56 percent for Bush. The most
interesting universe is that which produces such a sample only
about 5 percent of the time, simply because of the correspondence
of this value to a conventional break-point in statistical
inference. So we could experiment with various universes by
trial and error to find this universe.
We can learn from our previous simulations of the Bush-
Dukakis poll in Approach 1 that about 95 percent of the samples
fall within .035 on either side of the sample mean (which we had
been implicitly assuming is the location of the population mean).
If we assume (and there seems no reason not to) that the
dispersions of the universes we experiment with are the same, we
will find (by symmetry) that the universe we seek is centered on
those points .035 away from .56, or .535 and .585.
From the standpoint of Approach 2, then, the conventional
sample formula that is centered at the mean can be considered a
shortcut to estimating the boundary distributions. We say that
the boundary is at the point that centers a distribution which
has only a (say) 2.5 percent chance of producing the observed
sample; it is that distribution which is the subject of the
discussion - that is, one of the distributions at the endpoints
of the vertical line in Figure II-3-1 - and not the distribution
which is centered at mu = xbar. [1]
The results of these simulations are shown in Figure II-4-2.
Figure II-4-2
About these distribution centered at .535 and .585 - or more
importantly for understanding an election situation, the universe
centered at .535 - one can say: Even if the "true" value is as
low as 53.5 percent for Bush, there is only a 2 1/2 percent
chance that a sample as high as 56 percent pro-Bush would be
observed. (The values of a 2 1/2 percent probability and a 2 1/2
percent difference between 56 percent and 53.5 percent are
seemingly related arithmetically only by chance in this case.)
It would be even more revealing in an election situation to make
a similar statement about the universe located at 50-50, but this
would bring us almost entirely within the intellectual ambit of
hypothesis testing.
The demonstrations above using both Approaches 1 and 2 shed
light on the logic of interpretation of confidence intervals. We
have no basis in the work so far to say that there is a 95
percent chance that the confidence intervals computed from a
particular sample captures the universe mean, or to make any
other such statement about the universe mean. Even so, unless
you have reason to believe that the probabilities of some
universe means are very different than others in the neighborhood
of the sample mean - which would seem to be a safe assumption in
the case with the presidential poll - then it would seem
reasonable to make betting odds that there is a 95 percent chance
that the confidence intervals computed from a particular sample
captures the universe mean. If so, there would seem nothing
objectionable in this "naive" interpretation for a particular
sample.
Samples Whose Observations May Have More Than Two Values
So far we have discussed samples and universes that we can
characterize as proportions of elements which can have only one
of two characteristics - green or red, 1 or 0. Now let us
consider observations that can be characterized by a wider
variety of numbers; these cases are both simpler and more complex
than proportional universes. These are problems with
"continuous" (really multi-valued) data instead of the two-value
election poll problem above. The binomial case has a deceptively
easy appearance; in many ways the present problem is easier to do
than most. (Incidentally, in contrast to the Bush-Dukakis poll
example above, the 1992 U. S. presidential election was not
binomial but trinomial, and therefore it is a much more difficult
problem to deal with.)
A collection that contains only two sorts of elements (say,
green and red chips) can be characterized by just the proportion
(and the total number of elements). But a collection of (say)
prices of farms sold in province Z in year t would be
characterized by the numbers sold at each of many prices (and the
total number of sales). In the latter case, we notice at least
two characteristics: a) some sort of average, and b) the extent
to which the elements are spread out (and there may be yet other
characteristics that interest us). The inferences that we make
about the dispersion of such a collection are another important
part of statistical inference, interesting both for the
information in itself and for the light it throws on the
certainty of our other inferences.
Consider, for instance, that we have just the one sale price
of 13Q. We could estimate that the distribution is centered
around 13Q, but we have no idea whether they are all 13Q, or
whether the other prices tend to be far from 13. What if there
are only two sale prices - 13Q and 15Q, but we have no other
information, not even the meaning of a Q unit? What can we
reasonably say about the distribution, given that we have been
assured that the two observations are a representative sample of
prices?
We might immediately guess that half of the population is
within, and half outside of, 13Q and 15Q. But what shape should
we guess for the distribution? Should it be horizontal? Shaped
like a Normal curve? Skewed to the right? Here we have no
recourse but to use some additional experience and perhaps
theory.
If we have some additional observations - say 10 more - we
could estimate the dispersion of the population, perhaps
calculating a standard deviation. That would give some guidance
even without assuming a shape for the distribution.
If we had some reason to assume that the distribution is
shaped Normally - say, if it arose from observations of a planet,
and the scatter could be assumed to be due to "error" - we could
immediately do the sort of inference that led to the Normal
distribution two centuries ago. If one of the observations is
quite far from the others - an apparent "outlier" - we could
calculate its probability if it is part of the same distribution,
using the standard deviation or other measure of the
distribution's dispersion. This would throw some light on
whether it probably was generated by the same universe as were
the other observations.
Approach 1 for Measured Data Example: Estimating Tree Diameters
What is the question? A horticulturist is experimenting
with a new type of tree. She plants 20 of them on a plot of
land, and measures their trunk diameter after two years. She
wants to establish a 90% confidence interval for the population
average trunk diameter. For the data given below, calculate the
mean of the sample and calculate (or describe a simulation
procedure for calculating) a 90% confidence interval around the
mean. Here are the 20 diameters (in no particular order):
8.5 7.6 9.3 5.5 11.4 6.9 6.5 12.9 8.7 4.8
4.2 8.1 6.5 5.8 6.7 2.4 11.1 7.1 8.8 7.2
What is the purpose to be served by answering the question?
Either Research & Development, or pure science
Is this a "probability" or a "statistics" question?
Statistics.
What is the form of the statistics question? Confidence
limits.
What is the description of the sample that has been
observed? The raw data as shown above.
Statistics of the sample? Mean of the tree data.
Which universe? Assuming that the observed sample is
representative of the universe from which it is drawn, what is
your best guess about the properties of the universe whose
parameter you wish to make statements about? Answer: That the
universe is like the sample above, containing the numbers
8.5...7.2 the population of trees that will grow with this new
type, as best estimated by the observations in the sample. (Are
there possibilities for Bayesian analysis?) No Bayesian prior
information will be included.
Which parameter do you wish to make statements about? The
mean.
Which symbols for the observed entities? Cards or computer
entries with numbers 8.5...7.2, sample of an infinite size.
If the universe is as guessed at, the variation among which
samples do you wish to estimate? Samples of size 20.
Here one may continue with conventional method. Everything
up to now is the same whether continuing with resampling or with
standard parametric test. The information listed above is the
basis for a conventional test.
Use perhaps a t test. Calculate the standard deviation, and
apply to t (Show the Normal first). Read the number of degrees
of freedom from the sample size above. Show the formula for mu
+- 2 s. d.
Continuing with resampling
What procedure will be used to produce the trial entities?
Random selection. Simple (single step), not complex (multiple
"if") sample drawings).
What procedure to produce re-samples? With replacement.
Number of drawings? 20 trees
What to record as result of re-sample drawing? The mean.
How to state the distribution of results? See histogram.
Choice of confidence bounds: 90% ?, two-tailed
Computation of probabilities within chosen bounds Read from
histogram.
Approach 2 for Measured Data: The Diameters of Trees
To implement Approach 2 for measured data, one may proceed
exactly as with Approach 1 above except that the output of the
simulation with the sample mean as midpoint will be used for
guidance about where to locate trial universes for Approach 2.
Working from the histogram in Figure II-3-?, we try universes
located at 53.8 and 58.2. The results are shown in Figure II-4-
3.
Figure II-4-3
Interpretation of Approach 2
Now to interpret the results of the second approach:
Assuming that the sample is not drawn in a biased fashion (such
as the wind blowing all the apples in the same direction), and
assuming that the population has the same dispersion as the
sample, we can say that distributions centered at the 95 percent
confidence points (each of them including a tail with 2.5 percent
of the area), or even further away from the sample mean, will
produce the observed sample only 5 percent of the time or less.
The result of the second approach is more in the spirit of a
hypothesis test than of the usual interpretation of confidence
intervals. Another statement of the result of the second
approach is: We postulate a given universe - say, a universe at
(say) the two-tailed 95 percent boundary line. We then say: The
probability that the observed sample would be produced by a
universe with a mean as far (or further) from the observed
sample's mean as the universe under investigation is only 2.5
percent. This is similar to the prob-value interpretation of a
hypothesis-test framework. It is not a direct statement about
the location of the mean of the universe from which the sample
has been drawn. But it is certainly reasonable to derive a
betting-odds interpretation of the statement just above, to wit:
the chances are 2 1/2 in 100 (or, the odds are 2 1/2 to 97 1/2)
that a population located here would generate a sample with a
mean as far away as the observed sample. And it would seem
legitimate to proceed to the further betting-odds statement that
(assuming we have no additional information) the odds are 97 1/2
to 2 1/2 that the mean of the universe that generated this sample
is no farther away from the sample mean than the mean of the
boundary universe under discussion. About this statement there
is nothing slippery, and its meaning should not be controversial.
Here again the tactic for interpreting the statistical
procedure is to restate the facts of the behavior of the universe
that we are manipulating and examining at that moment. We use a
heuristic device to find a particular distribution - the one that
is at (say) the 97 1/2 - 2 1/2 percent boundary - and simply
state explicitly what the distribution tells us implicitly: The
probability of this distribution generating the observed sample
(or a sample even further removed) is 2 1/2 percent. We could go
on to say (if it were of interest to us at the moment) that
because the probability of this universe generating the observed
sample is as low as it is, we "reject" the "hypothesis" that the
sample came from a universe this far away or further. Or in
other words, we could say that because we would be very surprised
if the sample were to have come from this universe, we instead
believe that another hypothesis is true. The "other" hypothesis
often is that the universe that generated the sample has a mean
located at the sample mean or closer to it than the boundary
universe.
The behavior of the universe at the 97 1/2 - 2 1/2 percent
boundary line can also be interpreted in terms of our
"confidence" about the location of the mean of the universe that
generated the observed sample. We can say: At this boundary
point lies the end of the region within which we would bet 97 1/2
to 2 1/2 that the mean of the universe that generated this sample
lies to the (say) right of it.
As noted in the preview to this chapter, we do not learn
about the reliability of sample estimates of the population mean
(and other parameters) by logical inference from any one
particular sample to any one particular universe, because in
principle this cannot be done. Instead, in this second approach
we investigate the behavior of various universes at the
borderline of the neighborhood of the sample, the characteristics
of those universes being chosen on the basis of their
resemblances to the sample. We seek, for example, to find the
universes that would produce samples with the mean of the
observed sample less than (say) 5 percent of the time. In this
way the estimation of confidence intervals is like all other
statistical inference: One investigates the probabilistic
behavior of hypothesized universes, the hypotheses being
implicitly suggested by the sample evidence but not logically
implied by that evidence.
Approaches 1 and 2 may (if one chooses) be seen as identical
conceptually as well as (in many cases) computationally. But as
I see it, the interpretation of them is rather different, and
distinguishing them helps one's intuitive understanding.
THE PROBLEM OF UNCERTAINTY ABOUT THE DISPERSION
The inescapable difficulty of estimating the amount of
dispersion in the population has greatly exercised statisticians
over the years. Hence I must try to clarify the matter. Yet in
practice this issue turns out not to be the likely source of much
error even if one is somewhat wrong about the extent of
dispersion, and therefore we should not let it be a stumbling
block in the way of our producing estimates of the accuracy of
samples in estimating population parameters.
Student's t test was designed to get around the problem of
the lack of knowledge of the population dispersion. But Wallis
and Roberts wrote about the t test: "[F]ar-reaching as have been
the consequences of the t distribution for technical statistics,
in elementary applications it does not differ enough from the
normal distribution... to justify giving beginners this added
complexity" (1956, p. x). "Although Student's t and the F ratio
are explained... the student ... is advised not ordinarily to use
them himself but to use the shortcut methods... These, being non-
parametric and involving simpler computations, are more nearly
foolproof in the hands of the beginner - and, ordinarily, only a
little less powerful" (p. xi).<1>
If we knew the population parameter - the proportion, in the
case we will discuss - we could easily determine how inaccurate
the sample proportion is likely to be. If, for example, we
wanted to know about the likely inaccuracy of the proportion of a
sample of 100 voters drawn from a population of a million that is
60% Democratic, we could simply simulate drawing (say) 200
samples of 100 voters from such a universe, and examine the
average inaccuracy of the 200 sample proportions.
But in fact we do not know the characteristics of the actual
universe. Rather, the nature of the actual universe is what we
seek to learn about. Of course, if the amount of variation among
samples were the same no matter what the Republican-Democrat
proportions in the universe, the issue would still be simple,
because we could then estimate the average inaccuracy of the
sample proportion for any universe and then assume that it would
hold for our universe. But it is reasonable to suppose that the
amount of variation among samples will be different for different
Democrat-Republican proportions in the universe.
Let us first see why the amount of variation among samples
drawn from a given universe is different with different relative
proportions of the events in the universe. Consider a universe
of 999,999 Democrats and one Republican. Most samples of 100
taken from this universe will contain 100 Democrats. A few (and
only a very very few) samples will contain 99 Democrats and one
Republican. So the biggest possible difference between the
sample proportion and the population proportion (99.9999%) is
less than one percent (for the very few samples of 99%
Democrats). And most of the time the difference will only be the
tiny difference between a sample of 100 Democrats (sample
proportion = 100%), and the population proportion of 99.9999%.
Compare the above to the possible difference between a
sample of 100 from a universe of half a million Republicans and
half a million Democrats. At worst a sample could be off by as
much as 50% (if it got zero Republicans or zero Democrats), and
at best it is unlikely to get exactly 50 of each. So it will
almost always be off by 1% or more.
It seems, therefore, intuitively reasonable (and in fact it
is true) that the likely difference between a sample proportion
and the population proportion is greatest with a 50%-50%
universe, least with a 0%-100% universe, and somewhere in between
for probabilities between 50% and the endpoints, in the fashion
of Figure II-4-4.
Figure II-4-4
Though one commonly estimates the variation of sample means
(sample sizes the same as the observed sample) for proportions in
the neighborhood of the estimate population mean - which implies
a population dispersion (s. d.) appropriate for that
neighborhood, one could also use a more "conservative" estimate
of dispersion; Mosteller et. al. (1970) suggest that if you work
with the largest possible amount of variation (for example, the
value at .5 in the case of a problem involving a proportion), you
ensure that you cannot obtain too small a confidence interval by
underestimating the variation. (Here again we see the role of
judgment, as discussed in Chapter 00)
Perhaps it will help to clarify the issue of estimating
dispersion if we consider this: between an estimate for a second
sample based on a) the population, or on b) the first sample, the
former will be more accurate than the latter, because of the
sampling variation in the first sample that affects the latter
estimate. But we cannot estimate that sampling variation without
knowing more about the population.
ARGUMENTS ABOUT INTERPRETATION OF CONFIDENCE INTERVALS
Discussions of confidence intervals often assert that one
cannot make a probability statement about where the population
mean may be, but one can make statements about the probability
that a set of samples may bound it. For example:
... Although on average X-bar is on target, the
specific sample mean X-bar that we happen to observe is
almost certain to be a bit high or a bit low.
Accordingly, if we want to be reasonably confident that
our inference is correct, we cannot claim that mu is
precisely equal to the observed X-bar. Instead, we
must construct an interval estimate or confidence
interval of the form:
mu = X-bar + sampling error
The crucial question is: How wide must this allowance for
sampling error be? The answer, of course, will depend on
how much X-bar fluctuates...
Constructing 95% confidence intervals is like pitching
horseshoes. In each case there is a fixed target, either
the population mu or the stake. We are trying to bracket it
with some chancy device, either the random interval or the
horseshoe. This analogy is illustrated in Figure 8-3.
There are two important ways, however, that confidence
intervals differ from pitching horseshoes. First, only
one confidence interval is customarily constructed.
Second, the target mu is not visible like a horseshoe
stake. Thus, whereas the horseshoe player always knows
the score (and specifically, whether or not the last
toss bracketed the stake), the statistician does not.
He continues to "throw in the dark," without knowing
whether or not a specific interval estimate has
bracketed mu. All he has to go on is the statistical
theory that assures him that, in the long run, he will
succeed 95% of the time. (Wonnacott and Wonnacott,
1990, p. 258).
This criticism does not seem to me to fit approach 1 above. The
criticism apparently stems from objections by the frequentists.
But if one takes the operational-definition point of view (see
Chapter 00), and if we agree that our interest is upcoming events
and probably decision-making, then we obviously are interested in
putting betting odds on the location of the population mean (and
subsequent samples). A statement about process will not help us
with that, but only a probability statement.
Notice that in the earlier discussion it was never necessary
to use the notion of the "true" population mean that such writers
as Wonnacott and Wonnacott employ (see their appendix). As
discussed in Chapter 00, the notion of a "true parameter" tends
to confuse the issue, and is out of keeping Einstein's device of
the operational definition. Rather than having in mind some
"true" value, we should instead ask: "What will happen if I...",
or "...if I again..."
Bayesians, too, complain of the process point of view.
Savage writes that the process
...is a sort of fiction; for it will be found that
whenever its advocates talk of making assertions that
have high probability, whether in connection with
testing or estimation, they do not actually make such
assertions themselves, but endlessly pass the buck,
saying in effect, "This assertion has arisen according
to a system that will seldom lead you to make false
assertions, if you adopt it. As for myself, I assert
nothing but the properties of the system."(1972, pp.
260-261)
Lee writes at greater length:[where else is quote below?]
[T]he statement that a 95% confidence interval for an
unknown parameter ran from -2 to +2 sounded as if the
parameter lay in that interval with 95% probability and
yet I was warned that all I could say was that if I
carried out similar procedures time after time then the
unknown parameters would lie in the confidence
intervals I constructed 95% of the time.
Subsequently, I discovered that the whole theory
had been worked out in very considerable detail in such
books as Lehmann (1959, 1986). But attempts such as
those that Lehmann describes to put everything on a
firm foundation raised even more questions. (Lee,
1989, p. vii)
NOTES ON THE USE OF CONFIDENCE INTERVALS
1. Confidence intervals are used more frequently in the
physical sciences - indeed, the concept was developed for use in
astronomy - than in bio-statistics and in the social sciences; in
these latter fields, measurement is less often the main problem
and the distinction between hypotheses often is difficult.
2. Some statisticians suggest that one can do hypothesis
tests with the confidence-interval concept. But that seems to me
equivalent to suggesting that one can get from New York to
Chicago by flying first to Los Angeles. Additionally, the logic
of hypothesis tests is much clearer than the logic of confidence
intervals, and it corresponds to our intuitions so much more
easily.
3. Discussions of confidence intervals sometimes assert
that one cannot make a probability statement about where the
population mean may be, yet can make statements about the
probability that a particular set of samples may bound that
mean.
If one takes the operational-definition point of view (see
discussion of that concept in connection with the concept of
probability), and we agree that our interest is upcoming events
and probably decision-making, then we obviously are interested in
putting betting odds on the location of the population mean (and
subsequent samples). And a statement about process will not help
us with that, but only a probability statement.
Moving progressively farther away from the sample mean, we
can find a universe that has only some (any) specified small
probability of producing a sample like the one observed. One can
say that this point represents a "limit" or "boundary" between
which and the sample mean may be called a confidence interval, I
suppose.
SUMMARY
Let's summarize what one can and cannot assert about
confidence intervals:
1. One can always state the probability that a given
population S will produce a given sample s (or more precisely, a
sample with a given mean xbar, or other parameter). This is a
straightforward deduction which can be performed either
theoretically with formal probability theory or with a Monte
Carlo resampling technique. Indeed, such statements are the core
of all statistics problems; all the rest of statistics is
interpretation.
2. Derived from (1) above, one can state the relative
probabilities, and the ratio of them, of the probabilities of two
given S's producing a given s.
3. One cannot ever estimate the probability that a
particular sample came from any particular population - or even
put probabilistic bounds (confidence limits) around its mean - on
the basis of sample evidence alone. This is the issue of
induction that mathematicians and philosophers have been
struggling with for more than two centuries, and undoubtedly
before that, too. Even if one knows the mean of a population
that would produce the observed sample (or a sample even further
away) only (say) 5 percent of the time, one cannot say anything
about the probability that that particular population produced
the observed sample based on only the sample evidence. The
probability of any given population depends on probabilities of
other populations.
To see that this is so, postulate that we have been told
that a given sample of green and red balls was produced by either
one of two universes - A with a proportion of X green balls, and
B with a proportion of Y green balls - and it is equally likely
which urn the sample was drawn from. Assume we are able to state
(using Bayes reasoning) that it is twice as likely that the
sample came from urn A as urn B. If we take urn A as our
reference, it is clear that if alternative urn B had a different
proportion than Y as stated above, our conclusion would be
different than twice as likely. This demonstrates that without
some other assumption about the alternatives to any stated
population, no meaningful probability statement could be made
about the probability that a sample came from that universe.
Here again I repeat the crucial distinction between
discussing the probability that a sample could come from a given
universe, and the probability that a sample came from a given
universe. The former is straightforward, as in (1) above; the
latter cannot be stated meaningfully without additional
assumptions. Not distinguishing between these two statements may
be at the heart of most muddles about the fundamentals of
statistics.
With the first approach described in this chapter, we can
sensibly say something about the probability that the mean of the
population that produced a particular sample is within some
distance of the sample mean, or that a particular population has
only an X percent chance of producing a sample like this one.
Those statements are entirely different from speaking about the
probability that the sample came from a given population.
With the second approach described in this chapter, one can
say that the confidence interval includes all the means of
populations that have a greater than 5 percent chance of
producing the observed sample. This crucial statement may be
cumbersome, but it is logically airtight. On the other hand
this does not imply - so far as I can now see - anything about
the mean of the population from which this sample actually came -
or more precisely, the population that produced this sample.
The oft-denounced statement that the confidence interval
includes the population mean, or that the population mean lies
within those bounds, with probability of (say) 95 percent is
loose but not too bad if we include implicit assumptions about
non-bias and about the dispersion of the population and the
sample. Or, as some would prefer, this procedure will lead to
those points bracketing the population mean 95 percent of the
time you do this sort of thing. Such statements probably are not
very inaccurate, given that the world around us is well-behaved
in such respects most of the time (see Chapter I-1). And such
statements should be generally acceptable. But they are not
logically implied. Nor can any of this be proven empirically in
any way, so far as I know. (It might be tested on assumptions of
equality of dispersion along the continuum, and assuming a
continuum of some sort. But this may not be a profitable avenue
of thought.)
ENDNOTES
**FOOTNOTES**
[1]: When working with proportions, the conventional method
must obtain these points from prepared ellipses and binomial
tables, not from the sort of geometric trick used in the previous
paragraphs. Hence showing the distribution centered at xbar =
mu, as in the conventional approach, is quite misleading.
[out???: There seems to me to be no basis for this, either.
After all, a single sample may be regarded as n samples of size
one. Why should one be able to draw different sorts of
conclusions from a set of samples of size one than for the
evidence in all those samples aggregated into a single large
sample? The principle is the same.
**ENDNOTES**
<1>: They go on to say, "Techniques and details, beyond a
comparatively small range of fairly basic methods, are likely to
do more harm than good in the hands of beginners...The great
ideas...are lost...nonparametric [methods] involving simpler
computations, are more nearly foolproof in the hands of the
beginner" (1956, viii, xi). Their stance is very much in
contrast to that of Fisher, who wrote somewhere about the t test
as a "revolution"..