CHAPTER II-2 TESTING HYPOTHESES The concept of sameness was said in Chapter 00 to be the basic idea in all of knowledge-getting, and the use of the assumption of sameness is the most valuable heuristic we have for the process of inference. Following on this, it obviously is of great importance that one be able to say with some confidence whether two collections of things are or are not the same for one's current purposes. In most situations in which we find ourselves, it is quite clear without refined inquiry whether or not it is sensible to treat the collections or items as the same. When you see bags of potato chips on the shelf of a supermarket, you don't bother to examine which bag has the more or better-shaped chips, though some discriminating purchasers examine individual pieces of fresh fruit at the greengrocer; any possible difference between the potato-chip bags is too small for you to be concerned with, you are sure. But when you see a group of Japanese tourists born in the 1930s or earlier, you are sure that their heights are far less on average than the next group of Japanese tourists who were born in the 1960s. And you do not need to collect data or consult previous studies to know that ingested alcohol has a powerful effect on the human constitution and behavior. From time to time, however, situations arise wherein we are in doubt about whether two collections or items should be considered the same or different (whether the heights of two groups of tourists, say) or whether one "outlier" should be considered to be from the same collection as other observations (of a planet, say, one of the earliest problems in statistical inference). To help us make a determination, we call upon one of the main procedures in statistical inference - the testing of hypotheses. The logic of hypothesis testing is the subject of this chapter. This logic is relatively uncomplicated and uncontroversial compared to the logic of confidence intervals, discussed in Chapters 00. Yet it, too, cannot be done as a matter of routine, and requires judgment. The first published formal test of a hypothesis was by John Arbuthnot, doctor to Queen Anne of England, who thereby becomes the father of formal statistical inference in 1710. He observed that more boys than girls are born, which he assumed is necessary for the survival of the species, and he wished to prove that birth sex is indeed not a 50-50 probability. The records for London showed that male births exceeded female 82 years in a row. Arbuthnot therefore set forth to (in modern language) test the hypothesis that a universe with a 50-50 probability of producing males could result in 82 successive years with preponderantly male births. This is a canonical problem. You have some observed "sam- ple" data, and you want to connect them to some specified "popu- lation" from which they may have come. The previous sentence was purposely worded vaguely because statistical questions can be stated in many different ways. But in this case statisticians agree on how to proceed: Specify the universe, and compare its behavior against the observed sample. If it is unlikely that a sample as surprising as the observed sample should come from the specified universe, conclude that the sample did not come from that universe. Chapter III-1 describes how Arbuthnot went about it. The practical business of carrying out a statistical inference begins with the translation of a general question into a scientific question, and thence into a question amenable to statistical treatment; this process of translation, common to all scientific inference, is discussed in Chapter 00. The general procedure for the probabilistic manipulation carried out in the context of a statistical inference, which pertains to all statistical inference and not just to testing hypotheses, is set forth below. (The subject was introduced in Chapter I-1.) The overall procedure for a statistical inference, from the translation of the question to the conclusion, can be framed in a long series of questions and answers about the nature of the universe(s) and sample(s), the probabilistic manipulation, and then interpretation. The canonical series of these questions and answers for testing hypotheses is presented in this chapter, and the series of questions for finding confidence intervals is in Chapter 00. THE STEPS IN STATISTICAL INFERENCE These are the steps in conducting statistical inference: 1. Frame a question in the form of: What is the chance of getting the observed sample s from some specified population S? The postulated universe S bears some likeness to the model created by the researcher against which to test the observed data. But instead of deriving from theory, insight, hunch, whatever, in inference the model derives from the sample (plus perhaps a Bayesian prior). Another difference from the "scientific" model is that the postulated universe S has no causal connection to sample s except the process of (random?) sampling. Universe S is like a scientific model in that it is assumed not to be a perfect picture of nature. But unlike a scientific model, in the case of a finite universe we assume that larger and larger samples can approach the actual universe. 2. Reframe the question in the form of: What kinds of samples does population S produce, with which probabilities? That is, what is the probability of the observed sample s given that a population is S? Or, what is p(s!S)? 3. Actually investigate the behavior of S with respect to s and other samples. This can be done in two ways: a. Use the calculus of probability ("math"), perhaps resorting to the Monte Carlo method if an appropriate formula does not exist. Or b. Resampling (in the larger sense), which equals the Monte Carlo method minus its use for approximations, investigation of complex functions in statistics and other theoretical mathematics, and non-resampling uses elsewhere in science. Resampling in the more restricted sense includes bootstrap, permutation, and other non-parametric methods. More about the resampling procedure follows the paragraphs to come, and then in later chapters in the book. 4. Interpret the probabilities that result from step 3 in terms of acceptance or rejection of hypotheses, surety of conclusions, and as inputs to decision theory. The following short definition of statistical inference summarizes the previous four steps: statistic inference equals the selection of a probabilistic model to resemble the process you wish to investigate, the investigation of that model's behavior, and the interpretation of the results. STEPS IN ESTIMATION OF STATISTICAL PROBABILITIES BY RESAMPLING Stating the steps to be followed in a procedure is an operational definition of the procedure. My belief in the clarifying power of this device is embodied in the several sets of steps given in this chapter for the various aspects of statistical inference. This section sets forth the steps if the computation of the probabilities if the inference will be done with resampling. More detail may be found in the rest of this chapter, and in Chapter 00. Let us define resampling in a fashion that will include not only problems in inferential statistics but also problems in probability, as follows: Using the entire set of data you have in hand, or using the given data-generating mechanism (such as a die) that is a model of the process you wish to understand, produce new samples of simulated data, and examine the results of those samples. In some cases, it may also be appropriate to amplify this procedure with additional assumptions. Problems in pure probability may at first seem different in nature than problems in statistical inference. But the same logic as stated in this definition applies to both varieties of problems. The difference is that in probability problems the "model" is known in advance -- say, the model implicit in a deck of poker cards plus a game's rules for dealing and counting the results -- rather than the model being assumed to be best estimated by the observed data, as in resampling statistics. The following general procedure simulates what we are doing when we estimate a probability using resampling operations: Step A. Construct a simulated "universe" of cards or dice or some other randomizing mechanism whose composition is similar to the universe whose behavior we wish to describe and investigate. The term "universe" refers to the system that is relevant for a single simple event. For example: A coin with two sides, or two sets of random numbers "1-105" and "106-205", simulates the system that produces a single male or female birth, when we are estimating the probability of three girls in the first four children or nine female calves in ten births (the problem to be treated below.) Notice that in this universe the probability of a female remains the same from trial event to trial event -- that is, the trials are independent -- demonstrating a universe from which we sample with replacement. Step(s) B. Specify the procedure that produces a pseudo- sample which simulates the real-life sample in which we are interested. That is, specify the procedural rules by which the sample is drawn from the simulated universe. These rules must correspond to the behavior of the real universe in which you are interested. To put it another way, the simulation procedure must produce simple experimental events with the same probabilities as the simple events have in the real world. For example: In the case of three daughters in four children, or nine female calves in ten births, you can draw a card and then replace it if you are using a deck of red and black cards. Or if you are using a random-numbers table, the random numbers automatically simulate replacement. Just as the chances of having a female or a male do not change depending on the sex of the preceding birth, so we want to ensure through replacement that the chances do not change each time we choose from the deck of cards. Recording the outcome of the sampling must be indicated as part of this step, e.g. "record `yes' if female, `no' if male." Step(s) C. If several simple events must be combined into a composite event, and if the composite event was not described in the procedure in step B, describe it now. For example: For the number of females in a sample of births, the procedure for each simple event of a single birth was described in step B. Now we must specify repeating the simple event four times, and counting whether the outcome is or is not three girls in the four childbirths or nine females in ten calves. Recording of "three or more girls" or "two or less girls", and "9 or more females" or "8 or fewer", is part of this step. This record indicates the results of all the trials and is the basis for a tabulation of the final result. Step(s) D. Calculate the probability of interest from the tabulation of outcomes of the resampling trials. For example: the proportions of a) "yes" and "no", and b) "9 or more" and "8 or fewer", estimate the likelihood we wish to estimate in step C. An Example: [From Hodges and Lehman, 1970]: Female calves are more valuable than male calves. A bio-engineer claims to have a method that can cause more females. He tests the procedure on ten of your pregnant cows, and the result is nine females. Should you believe that his method has some effect? That is, what is the probability of a result this or more surprising occurring by chance? The actual computation of probability may be done with several formulaic or sample-space methods, and in several resampling methods. I will first show a resampling method and then several conventional methods. The following material that allows one to compare resampling and conventional methods is more germane to the explication of resampling in Chapters 00 and 00 than it is to the theory of hypothesis test discussed in this chapter, but it is more expedient to present it here. Computation of Probabilities with Resampling We can do the problem by hand as follows: 1. Constitute an urn with either one blue and one pink balls, or 106 blue and 100 pink balls. 2. Draw ten balls with replacement, count pinks, and record. 3. Repeat step (2) say 400 times. 4. Calculate proportion of results with 9 or 10 pinks. Or, we can take advantage of the speed and efficiency of the computer as follows (also in ycha071): REPEAT 15000 GENERATE 10 1,2 A COUNT A =1 B SCORE B Z END HISTOGRAM Z COUNT Z >=9 K DIVIDE K 15000 KK PRINT KK 4000+ + * + * F + * r + * e 3000+ * * * q + * * * u + * * * e + * * * n + * * * c 2000+ * * * y + * * * * * + * * * * * * + * * * * * + * * * * * Z 1000+ * * * * * + * * * * * + * * * * * * * + * * * * * * * + * * * * * * * * * 0+----------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 0 2 4 6 8 10 Vector no. 1: Z Bin Cum Center Freq Pct Pct -------------------------------------------- 0 22 0.1 0.1 1 163 1.1 1.2 2 650 4.3 5.6 3 1801 12.0 17.6 Resampling Stats - D:\...\CALVES.STA - Wed 7/ 7/93 13:40:59 - Page 2 4 3075 20.5 38.1 5 3717 24.8 62.9 6 3035 20.2 83.1 7 1739 11.6 94.7 8 636 4.2 98.9 9 145 1.0 99.9 10 17 0.1 100.0 Note: Each bin covers all values within 0.1 of its center. KK = 0.0108 CONVENTIONAL METHODSSample Space and First Principles Assume for a moment that our problem is a smaller one and therefore much easier - the probability of getting two females in two calves if the probability of a female is .5. One could then map out the sample space, and find the proportion of points that correspond to a "success". We list all four possible combinations - FF, FM, MF, MM. Now we look at the ratio of the number of combinations that have 2 females to total, which is 1/4. We may then interpret this probability. We might also use this method for (say) five female calves in a row. We can make a such as FFFFF, MFFFF, MMFFF, MMMFFF...MFMFM...MMMMM. There will be 2*2*2*2*2 = 32 possibilities, and 64 and 128 possibilities for six and seven calves respectively. But when we would get as high as ten calves, this method would become very troublesome. Sample Space Calculations For two females in a row, we could use the well known, and very simple, multiplication rule; we could do so even for ten females in a row. But calculating the probability of nine females in ten is a bit more complex. Pascal's Triangle One can use Pascal's Triangle to obtain binomial coefficients for p = .5 and a sample size of 10, focusing on those for 9 or 10 successes. Then calculate the proportion of the total cases with 9 or 10 "successes" in one direction, to find the proportion of cases that pass beyond the criterion of 9 females. The method of Pascal's Triangle requires more complete understanding of the probabilistic system than does the resampling simulation described above because Pascal's Triangle requires that one understand the entire structure; simulation requires only that you follow the rules of the model. The Quincunx The quincunx is more a simulation method than theoretical, but it may be considered "conventional". Hence I include it here for completeness. Table of Binomial Coefficients The Pascal Triangle becomes cumbersome or impractical with large numbers - say, 17 females of 20 births - or with probabilities other than .5. One might produce the binomial coefficients by algebraic multiplication, but that, too, becomes tedious even with small sample sizes. One can also use the pre- computed table of binomial coefficients found in any standard text But the probabilities for n = 10 and 9 or 10 females are too small to be shown. Binomial Formula For larger sample sizes, one can use the binomial formula. The binomial formula gives no deeper understanding of the statistical structure than does the Triangle, but it does yield a deeper understanding of the pure mathematics.) With very large numbers, even the binomial formula is cumbersome. The Normal Approximation When the sample size becomes too large for any of the above methods, one can then use the Normal approximation, which yields results close to the binomial (as seen very nicely in the output of the quincunx). But to employ the Normal distribution one requires an estimate of the standard deviation, which can be derived either by formula or by resampling. (See a more extended parallel discussion in Chapter 00 on confidence intervals for an election poll.) The desired probability can be obtained from the Z formula and the a standard table for the Normal distribution found in every elementary text. The Z table can be made less mysterious if one generates it with simulation, or with graph paper or Archimedes' method, using as raw material (say) five "continuous" (that is, non-binomial) distributions, many of which are skewed: 1) Draw samples of (say) 50 or 100. 2) Plot the means to see that the Normal shape is the outcome. Then 3) standardize with the standard deviation by marking the standard deviations onto the histograms. The aim of the above exercise and the heart of the conventional parametric method is to compare the sample result - the mean - to a standardized plot of the means of samples drawn from the universe of interest to see how likely it is that that universe produces means deviating as much from the universe mean as does our observed sample mean. The steps are: 1. Establish the Normal shape - from the exercise above, or from the quincunx or Pascal's Triangle or the binomial formula or the formula for the Normal approximation or some other device. 2. Standardize that shape in standard deviations. 3. Compute the Z score for the sample mean - that is, its deviation from the universe mean in standard deviations. 4. Examine the Normal (or really, tables computed from graph paper, etc.) to find the likelihood of a mean being that far by chance. This is the canon of the procedure for most parametric work in statistics. For some small samples, accuracy is improved by using the t distribution, a matter discussed in Chapter 00. CHOICE OF THE BENCHMARK UNIVERSE<1> In the example of the ten calves, the choice of a benchmark universe - a universe that (on average) produces equal proportions of males and females - seems rather straightforward and even automatic, requiring no difficult judgments. But in other cases the process requires more judgments to be made. Let's consider another case where the choice of a benchmark universe requires no difficult judgments. Assume the U.S. Department of Labor's Bureau of Labor Statistics takes a very large sample - say, 20,000 persons - and finds a 10 percent unemployment rate. At some later time another but smaller sample is drawn - 2,000 persons - showing an 11 percent unemployment rate. Should BLS conclude that unemployment has risen, or is there a large chance that the difference between 10 percent and 11 percent is due to sample variability? In this case, it makes rather obvious sense to ask how often a sample of 2,000 drawn from a universe of 10 percent unemployment (ignoring the variability in the larger sample) will be as different as 11 percent just due to sample variability? This problem differs from that of the calves only in the proportions and the sizes of the samples. Let's change the facts and assume that a very large sample had not been drawn and only a sample of 2,000 had been taken, indicating 11 percent unemployment. A policy-maker asks the likelihood that unemployment is above ten percent. It would still seem rather straightforward to ask how often a universe of 10 percent unemployment would produce a sample of 2000 with a proportion of 11 percent unemployed. Still another problem where the choice of benchmark hypothesis is relatively straightforward: Say that BLS takes two samples of 2000 persons a month apart, and asks whether there is a difference in the results. Pooling the two samples and examining how often two samples drawn from the pooled universe are as different as are observed seems obvious. One of the reasons that the above cases - especially the two-sample case - seems so clearcut is that the variance of the benchmark hypothesis is not an issue, being implied by the fact that the samples deal with proportions. If the data were continuous, however, this issue would quickly arise. Consider, for example, that the BLS might take the same sorts of samples and ask unemployed persons the lengths of time they had been employed. Comparing a small sample to a very large one would be easy to decide about. And even comparing two small samples might be straightforward - simply pooling them as is. But what about if you have a sample of 2,000 with data on lengths of unemployment spells with a mean of 30 days, and you are asked the probability that it comes from a universe with a mean of 25 days? Now there arises the question about the amount of variability to assume for that benchmark universe. Should it be the variability observed in the sample? That is probably an overestimate, because a universe with a smaller mean would probably have a smaller variance, too. So some judgment is required; there cannot be an automatic "objective" process here, whether one proceeds with the conventional or the resampling method. The example of the comparison of liquor retailing systems in Chapter 00 provides more material on this subject. THE CONCEPT OF STATISTICAL SIGNIFICANCE IN TESTING HYPOTHESES Hypothesis tests using the concept of significance have been misused almost since their origin; the flaws were pointed out early on by my friend and editor, Hanan Selvin, and since then have been discussed often and so well that no discussion is needed here. This section offers only an interpretation of the meaning of "significant" in connection with the logic of significance tests. 1. Consider the nine-year-old who tells the teacher that the dog ate the homework. Why does the teacher not accept the excuse? Clearly it is because the event would be too "unusual". But why do we think that way? Let's speculate that you survey a million adults, and only three report that they have ever heard of a real case where a dog ate somebody's homework. You are a teacher, and a student comes in without homework and says that a dog ate the homework. It could have happened -- your survey reported that it really has happened in three lifetimes out of a million. But it does not happen very often. Therefore, you probably conclude that because the event is so unlikely, something else must have happened -- for example, that the student did not do the homework. The logic is that if something seems very unlikely, it would therefore surprise us greatly if it were to actually happen, and therefore we assume that there must be a better explanation. This is why we look askance at unlikely coincidences when they are to someone's benefit. This is the logic of John Arbuthnot's hypothesis test about the ratio of births by sex in the first published hypothesis test (see Chapter 00), though his extension of his logic to God's design goes beyond the standard modern framework. It is also the implicit logic in the research on puerperal fever, cholera, and beri-beri, the data for which were shown in Chapter 00, though no explicit mention was made of probability in those cases. 2) Two students sit next to each other at an examination. Out of a hundred questions each student gets 82 right, and each of the mistakes that they make is on the same questions. Do you believe that the students cheated? You say to yourself: It would be most unlikely that they would have made the same mistakes by chance -- and you can compute how unlikely it would be -- and because it is so unlikely you therefore are likely to believe that they cheated. 3) The court is hearing a murder case. There is no eye- witness, and the evidence consists of such facts as the height and weight and age of the person charged, and other circumstantial evidence. Only one person in 50 million has such characteristics, and you find such a person. Will you convict the person, or will you assume that the evidence might have occurred just by chance? Of course it might have occurred by bad luck, but the probability is very very small. Will you therefore conclude that because the chance is so small, it is reasonable to assume that the person charged committed the crime? Sometimes the unusual really happens - the court errs by judging that the wrong person did it, and that person goes to prison or even is executed. The best we can do is to make the criterion strict: "Beyond a reasonable doubt". (People ask: What probability does that criterion represent? But the court will not provide a numerical answer.) 4) Somebody says to you: I am going to deal out five cards and it will be a royal flush - ten, jack, queen, king, and ace of a given suit. The person deals the cards and the royal flush appears. Do you think the occurrence happens just by chance? No, you are likely to be very dubious that it happened by chance. Then you believe there must be some other explanation -- that the person fixed the cards, for example. Note: You don't attach the same meaning to any other permutation, even though it is as rare - unless the person announced it in advance. Indeed, even if the person says nothing, you will be surprised at a royal flush, because this hand has meaning, whereas another given set of five cards do not. Two important points complicate the concept of statistical significance: 1. With a large enough sample, every treatment or variable will seem different from every other. Two faces of even a good die will produce different results in the very long run. Other statistics help interpret these results - for example, the beta coefficient or the partial regression coefficient (see Chapter 00). 2. Statistical significance does not imply economic or social significance. Two faces of a die may be statistically significant in a huge sample of throws, but a 1/10,000 difference is too small to make an economic difference in betting. Statistical significance is only a filter. If it appears, one should then proceed to decide whether there is substantive significance. Interpreting significance is sometimes complex, especially when the interpretation depends heavily upon your prior expectations - as it often does. For example, how should a basketball coach decide whether or not to bench a player for poor performance after a series of missed shots at the basket? Consider Coach John Thompson who, after Charles Smith missed 10 of 12 shots in the 1989 Georgetown-Notre Dame NCAA game, took Smith out of the game for a time (The Washington Post, March 20, 1989, p. C1). The scientific or decision problem is: Should the coach consider that Smith is not now a 47 percent shooter as he normally is, and therefore bench him? The statistical question is: How likely is a shooter with a 47 percent average to produce 10 of 12 misses? Would Coach Thompson take Smith out of the game after he missed one shot? Clearly not. Why not? Because one "expects" Smith to miss a shot half the time, and missing one shot therefore does not seem unusual. How about after Smith misses two shots in a row? For the same reason the coach still would not bench him, because this event happens "often" -- more specifically, about once in every sequence of four shots. How about after 9 misses out of ten shots? Notice the difference between this case and 9 female calves of ten. In the case of the calves we expected half females because the experiment is a single isolated trial. The event considered by itself has a small enough probability that it seems unexpected rather than expected. "Unexpected" seems to be closely related to "happens seldom" or "unusual" in our psychology. And an event that happens seldom seems to call for explanation, and also seems to promise that it will yield itself to explanation by some unusual concatenation of forces. That is, unusual events lead us to think that they have unusual causes; that is the nub of the problem. (But on the other hand, one can sometimes benefit by paying attention, as scientists know when they investigate outliers.) In basketball shooting, we expect 47 percent of Smith's individual shots to be successful, and we also expect that average for each set of shots. But we also expect some sets of shots to be far from that average because we observe many sets; such variation is inevitable. So when we see a single set of 9 misses in ten shots, we are not very surprised. But how about 29 misses in 30 shots? At some point, one must start to pay attention. (And of course we would pay more attention if beforehand, and never at any other time, the player said, "I can't see the basket today. My eyes are dim".) So, how should one proceed? Perhaps the same way as with a coin that keeps coming down heads a very large proportion of the throws, over a long series of tosses: At some point you examine it to see if it has two heads. But if your investigation is negative, in the absence of an indication other than the behavior in question, you continue to believe that there is no explanation and you assume that the event is "chance" and should not be acted upon. In the same way, a coach might ask a player if there is an explanation for the many misses. But if the player answers "no", the coach should not bench him. (There are difficulties here with truth-telling, of course, but let that go for now.) The key point for the basketball case and other repetitive situations is not to judge that there is an unusual explanation from the behavior of a single sample alone, just as with a short sequence of stock-price changes. We all need to learn that "irregular" (a good word here) sequences are less unusual than they seem to the naked intuition. A streak of 10 out of 12 misses for a 47 percent shooter occurs about every 3 percent of the time. That is, about every 33 shots Smith takes, he will begin a sequence of 12 shots that will end with 3 or fewer baskets - perhaps once in every couple of games. This does not seem "very" unusual, perhaps. And if the coach treats each such case as unusual, he will be losing some of the services of a better player than he replaces him with. In brief, How hard one should look for an explanation should depend on the likelihood of the event. But one should (almost) assume the absence of an explanation unless one actually finds it. Bayesian analysis could be brought to bear upon the matter, bringing in your prior probabilities based on knowledge that research has shown that there is no such thing as a "hot hand" in basketball, together with some sort of cost-benefit error-loss calculation comparing Smith and next best available player. The "data-dredging" issue was discussed in the context of the doctors' smoking by states in Chapter 00. ENDNOTE **ENDNOTES** <1>: This is one of many issues that Peter Bruce first raised, and whose treatment here reflects back-and-forth discussion between us.