''• "Hypothesis Testing" An Article for the International Encyclopedia of the Social and Behavioral Sciences George Casella* Cornell University Roger L. Berger North Carolina State University September 23, 1999 Introduction A hypothesis is a statement about a population parameter, and. the two complementary hypotheses in a hypothesis testing problem are called the null hypothesis and the alternative hypothesis. They are denoted by Ho and H 17 respectively. If 8 denotes a population parameter, the general format of the null and alternative hypotheses is Ho : 8 E 8o and H1 : 8 E 88 where 8o is some subset of the parameter space and 88 is its complement. Typically, a hypothesis test is specified in terms of a test statistic W(X1, ... , Xn) = W(X), a function of the sample. For example, a test might specify that Ho is to be rejected if X, the sample mean, is greater than 3. In this case W(X) =X is the test statistic and the rejection region is {(x1, ... , xn): x > 3}. Constructing Tests . · There are many methods of deriving test statistics for a hypothesis test, a few of which follow: 1. Likelihood Ratio Tests The likelihood ratio method of hypothesis testing is related to the maximum likelihood estimators discussed in the article on point and *Supported by National Science Foundation · Grant DMS-9971586. Email: gc15«1cornell.edu. This is technical report BU-1454-M in the Department of Biometrics, Cornell University, Ithaca, NY 14853. 1 interval estimation. Given a likelihood function L(Ojx), the likelihood ratio ·test statistic for testing Ho: (} E So versus H1: 0 E eg is supL(Ojx) (1) .X(x ) = -sSu-o=-p-L-(9-lx-)" e A likelihood ratio test (LRT) is any test that has a rejection region of the form {x: .X(x) D c}, where cis any number satisfying 0 D c D 1. If we interpret the likelihood function as measuring how likely the values of(} are, then we see that the LRT is comparing the plausibility of the (} values in the null hypothesis to those in the alternative. Small . values of the LRT statistic are interpreted as being evidence against · Ho and lead to rejection of Ho. If the null hypothesis consists ofa single value 90, and the alternative is everything else, then the LRT statistic is simply .X= L(00 ix)/L(Oix), where 0 is the MLE of 0. Example Let X 1, ... ,Xn be a random sample from a N(O, 1) population. The LRT statistic for testing Ho: (} = 00 versus H 1 : (} =f. 90 is If T(X) is a sufficient statistic for (} then, as with maximum likelihood estimators, the LRT statistic is a function ofT. That is; .X(x) depends on x only through T(x) 2. Bayesian Tests The Bayesian paradigm prescribes that the· sample information be combined with the prior information using Bayes' Theorem to obtain the posterior distribution 7!'(0lx). All inferences about (}are now based on the posterior distribution. In a hypothesis testing problem, the posterior distribution may be used to calculate the probabilities that Ho and H1 are true. iaOne way Bayesian hypot4esis tester may choose to use the posterior distribution is to decide to accept Ho as true if ~ :~~~ 2:: c for some 0. constant c, and to reject Ho otherwise. Equivalently, we can reject Ho if P(O E 83IX) is greater than a specified number. 2 Example Let X1, ... , Xn be iid N(O, o-2) and let the prior distribution on() be N(J.£,T2 ) where a 2,J.£, and T 2 are known. Consider testing Ho : () 0 Oo versus H1 : () > Oo where we decide to accept Ho if P(O E 8oiX) 2: P(() E 83IX). After some calculation, we find that Ho will be accepted as true if 3. Union-Intersection and Intersection-Union Tests In some situations, tests for complicated null hypotheses can be developed from. tests for simpler null hypotheses. The union-intersection method of test construction might be useful when the null hypothesis is conveniently expressed as an intersection, say Ho : (} E n-rEr 8-y, where r is an arbitrary index set. If tests are available for each of the problems of testing Ho-y: (} E 8-y versus H1-y: (} E 8~ where the rejection region for · the test of Ho-y is {x: T-y(x) E .R.y}, then the rejection region for the union-intersection test is U{x: T-y(x) E .R.y}. -rEr The rationale is that if any one of the hypotheses Ho-y is rejected, then H0 must also be rejected. A complementary method, the intersection-union method, may be useful if the null hypothesis is conveniently expressed as a union. Suppose we wish to test the null hypothesis Ho : (} E U-rEr 8-y, and {x: T-y(x) E .R.y} is the rejection region for a test of Ho-y: (} E 8-y versus H~-y: (} E 8~. Then the rejection region for the intersection-union test of Ho versus H1 is (2) n· {x: T-y(x) E .R.y}. -rEr Ho is false if and only if all of the Ho-y are false, so Ho can be rejected if and only if each of the individual hypotheses Ho-y can be rejected. Example The topic of acceptance sampling provides an extremely useful application of an intersection-union test (see Berger 1982). In assessing the quality of upholstery fabric, standards dictate that parameters relating to strength and flammability must satisfy 01 > 50 3 pounds and 02 > .95, respectively. This results in the hypothesis test Ho: {01 0 50 or 02 0 .95} versus H1: {01 >50 and 02 > .95}, where a batch of material is acceptable only if H1 is accepted. If X1, ... , Xn are iid N(01·, a 2) and Y1,... , Ym are iid Bernoulli(02), where Yi = 1 if the ith sample passes the flammability test, the rejec- tion region for the intersection-union test is given by {(x, y) : x- 8/ . j5i0i > t and ~m Yi > } b . Thus the intersection-union test decides the product is acceptable, that is, H1 is true, if and only if it decides that each of the individual parameteJ:S meets its standard. There are many other methods available for constructing hypothesis tests, methods based on invariance, pivots, robust or large sample arguments, to name a few. For more on hypothesis testing see Lehmann (1986). Evaluating Tests A hypothesis test of Ho : 0 E Go versus H1 : 0 E Gg might make one of two types of errors. If 0 E 8o but the hypothesis test incorrectly decides to reject H0 , then the test has made a 'Pype I Error. If, on the other hand, 0 EGg but the test decides to accept Ho, a 'Pype II Error has been made. If R denotes the rejection region for a test, the power function is R X R = {probability of a Type I Error, o( E ) 1- the probability of a Type II Error, if(} EGo, if eEGg. A good test has power function near one for II!-OSt 0 E Gg and near zero for most (} E Go. Example Let X 1, ... , Xn be. a random sample from a N ((}, a 2) population, a2 known. The likelihood ratio test of Ho: (} 0 Oo versus H1: 0 > Oo rejects Ho if (X- Oo)/(aj.jii) > c and has power function Po (X ER )=P ( Oo- (}) Z>c+a/Vfi, where Z is a standard normal random variable. 4 .. ·\ ; After a hypothesis test is done, the conclusions must be reported in some statistically meaningful ·way. One method of reporting the results of a hypothesis test is to report the size (sup8eao Po(X E R)), a, of the test used and the decision to reject Ho or accept Ho .. The size of the test carries important information. H a is small, the decision to reject Ho is fairly convincing, but if a is large, the decision to reject Ho is not very convincing because the test has a large probability of incorrectly making that decision. Another way of reporting the results of a hypothesis test, one that is data-dependent, is to report the p-value. Typically, not one but an entire class of tests are constructed, a different test being defined for each value of · a. The p-value for the sample point x is the smallest value of a for which this sample point will lead to rejection of Ho. Because rejection of Ho using a test with small size is m9re convincing evidence that H 1 is true than rejection of Ho with a test with large size, the interpretation of p-values goes in the same way. The smaller the p-value, the stronger the sample evidence that H1 is true. Many other types of evaluations of tests can be done. The theory of most powerful tests shows how to construct best tests under a variety of conditions (see Lehmann 1986 or Casella and Berger 1990, Chapter 8). Hypothesis tests can also be evaluated using risk functions, as in Hwang et al. (1992). Asymptotics For the LRT statistic (1), the following general theorem allows us to ensure construct a large sample test. Theorem 1 Let X11 ... , Xn be a random sample from a pdf or.pmf f(xjB). Under some regularity conditions1 on the model f(xjB), if 8 E 8o then the distribution of the statistic - 2log .X(X) converges to a chi squared distribution as the sample size n -+ oo. The degrees of freedom of the limiting distribution is the difference between the number of free parameters specified by (J E 9o and the number of free parameters specified by 8 E 8. Rejection of Ho: 8 E 8o for small values of .X(X) is equivalent to rejection for large values of -2log .X(X). Thus, Ho is rejected if and only if - 2log .X(X) ;:::: x~,a, 1The "regularity conditions" are mainly concerned with the existence and behavior of the derivatives (with respect to the parameter) of the likelihood function, and the support of the distribution (it cannot depend on the parameter). See Lehmann (1986, Section 8.8) for precise conditions. 5 where v is the degrees of freedom specified in Theorem 1. Another large-sample test construction is based on asymptotic normality of a point estimator. Suppose· we wish to test a hypothesis about a real- . valued parameter 0, and Wn = W(Xb ... , Xn) is a point estimator of 0, based on a sample of size n, that satisfies Wn-0 Z . --t ' lTn u;where is the variance of Wn and Z is a standard normal random variable. We now have the basis for an approximate test. For example, we could reject Ho : 0 0 Oo at level .05 if (Wn- Oo)/un > 1.645. In some instances, Un also depends on unknown parameters. In such a case, we look for an estimate Sn of Un with the property that un/Sn converges in probability to one. Then, using Slutsky's Theorem (see Casella and Berger 1990, Section 5.3), we can deduce that (Wn-0)/Sn also converges in distribution to a standard normal distribution. A large-sample test may be based on this fact. Whether Un is estimated assuming()= Oo, or not, can lead to score and Wald tests, respectively. References 1. Berger, R. L. (1982). Multiparameter Hypothesis Testing and Acceptance Sampling. Technometrics 24, 295-300. 2. Casella, G. and Berger, R. L. (1990). Statistical Inference. Pacific Grove, CA: Wadsworth/Brooks Cole. 3. Hwang, J. T., Casella, G., Robert, C., Wells, M. T. and Farrell, R. H. (1992). Estimation of accuracy in testing. Ann. Statist. 20, 49G-509. 4. Lehmann, E. L. (1986). Testing Statistical Hypotheses, Second Edition. New York: Springer-Verlag 6