Analysis of Variance F Test (ANOVA)
The analysis of variance F test, or ANOVA, is used to compare the means of I populations, where the following hypotheses are tested:
H₀: μ₁ = μ₂ = ... = μI vs. Hₐ: not all population means are equal
Note that the alternative hypothesis is not Hₐ: μ₁ ≠ μ₂ ≠ ... ≠ μᵢ. It is not needed for all the population means to differ from one another in order to reject the null hypothesis, only one population mean is required to differ from the other means in order to reject the null hypothesis.
There are many possibilities for the null hypothesis to be false. Given four population means from four different populations, perhaps μ₁ = μ₃ = μ₄, yet μ₂ differs. Perhaps μ₂ = μ₃, yet μ₁ and μ₄ differ. Perhaps none of the means are equal. Therefore, the alternative hypothesis is no longer one-sided or two-sided, but "many-sided."
To understand the idea of ANOVA, consider the following two graphs:
The sample means in (A) are the same for (B), yet the variability for the samples in (B) is smaller than (A)'s variability.
If another sample from each population in (A) was taken, the sample means may differ from the original sample means in (A) due to the large variability in (A). Therefore, the observed difference in sample means can be due to the chance of variation, and the researcher may not reject the null hypothesis of all four population means being equal.
However, with (B), the variability is lower and so if another sample from each population in (B) was taken, it is unlikely the sample means would differ from the original samples of (B). It would be unlikely for the new sample mean for Population 1 to be as low as the new sample mean for Population 4.
Therefore, it appears the difference in sample means in (B) is not only due to the chance of variation but it can also be due to a true difference in population means. Even though the sample means are the same in (A) and (B), the researcher would likely not reject the null hypothesis for (A) but reject the null hypothesis for (B) because of (B)'s lower variability.
Although the above explanation gives an idea of ANOVA, it ignores the importance of sample size.
Assumptions for ANOVA The following are three assumptions when conducting inference procedures using ANOVA:
1. The variable of interest follows a normal distribution for each population of interest. 2. Each population's standard deviation are assumed to be equal, and so there is a common standard deviation σ. 3. Each of the I samples is a simple random sample from the respective populations. In the case of an experiment, then the treatments are randomly assigned to the individuals.
Analysis of Variance Analysis of variance compares the variation between the sample means to the variation within the individual samples.
If all of the population means are equal, then all of the observations come from the same normal distribution. It is expected that both the variation between the sample means and the variation within the individual samples represent only the inherent variability of the variable, which is the common standard deviation σ. Therefore, there should not be much difference between these two variations. However, if the means differ a lot, then the variation between the sample means is much larger than the variation within the individual samples.
Test Statistic F and F Distributions The ANOVA test statistic F for testing the equality of several population means is of the following form:
Like the t distributions, there are many F distributions. While the t distributions depend on only one degrees of freedom, the F distributions depend on two different degrees of freedom. One for the numerator of F and another one for the denominator of F.
The F distributions are all skewed to the right and never negative, since the F test statistic is a ratio between two variations, and variations are never negative.
The higher the value of the F test statistic, the greater the difference is between the sample means, which gives more evidence against the null hypothesis.
Example To find the upper 0.05 critical value for the F4,19 distribution, where 4 is the numerator degrees of freedom, or df₁, and 19 is the denominator degrees of freedom, or df₂, find df₁ = 4 and df₂ = 19 on the F* table, then the row for 0.05 is the critical value F* = 2.90.
Suppose there is a simple random sample of size nᵢ from each I different, normally distributed populations with a common standard deviation σ. Then the total sample size is the following:
N = n₁ + n₂ + ... + nI
For each I population, the I sample means and I sample standard deviations are calculated:
The researcher would like to test the following:
H₀: μ₁ = μ₂ = ... = μI vs. Hₐ: not all population means are equal
The overall mean x̿ for all samples is then the following:
Example A researcher would like to compare the effective of three different weight loss programs. Twenty overweight adults volunteer to participate in an experiment. Seven of the adults are randomly assigned to Program 1, seven are randomly assigned to Program 2, and six are randomly assigned to Program 3. The weight loss in pounds of each subject is measured at the end of three months. The researcher conducts a formal hypothesis test using a 5% level of significance: H₀: μ₁ = μ₂ = μ₃ vs. Hₐ: not all of the population means are equal Therefore, there are three populations. The first sample n₁ = 7 is Population 1, which are the overweight adults who follow Diet Program 1 for three months. The second sample n₂ = 7 is Population 2, which are the overweight adults who follow Diet Program 2 for three months. The last sample n₃ = 6 is Population 3, which are the overweight adults who follow Diet Program 3 for three months. The following is the data obtained after three months:
Using a program, each of the samples' box-plots are examined:
It is difficult to assess the normality of the distributions from box-plots based on small sample sizes. For the hypothesis test to be accurate, it must be assumed the weight losses follow a normal distribution for each of the three programs. According to the box-plots, it does not appear the three population means for the three programs are equal, but the hypothesis test will determine if the observed difference in these sample means is significant. From the data, the following is calculated: n₁ = 7, x̄₁ = 9.14, s₁ = 2.91 n₂ = 7, x̄₂ = 5.57, s₂ = 2.51 n₃ = 6, x̄₃ = 11.17, s₃ = 2.86 There are a total of I = 3 groups, and so the total sample size is N = 7 + 7 + 6 = 20. The overall mean x̿ is calculated: Σnᵢx̄ᵢ = (7)(9.14) + (7)(5.57) + (6)(11.17) = 169.99 x̿ = 169.99/N = 169.99/20 ≈ 8.50
Numerator and Denominator of the F Test Statistic The numerator of the F test statistic measures the variation between the sample means. The variation between the sample means is measured using the sum of square for groups, denoted as SSG, where SSG is the following formula:
Then the mean square for groups, or MSG, is the following:
MSG = SSG/(I – 1)
MSG gives a measure of how far apart the I sample means are to each other, and so MSG is the numerator of the F test statistic with I – 1 degrees of freedom.
Example For the previous example, SSG = (7)(9.14 – 8.50)² + (7)(5.57 – 8.50)² + (6)(11.17 – 8.50)² ≈ 105.73. MSG = SSG/(I – 1) = 105.73/2 ≈ 52.87
The denominator of the F test statistic measures the variation within the individual samples. The variation within individual samples is measured using the sum of squares for error, or SSE, where SSE is the following formula:
Then the mean square for error, or MSE, is the following:
MSE = SSE/(N – I)
MSE is the denominator of the F test statistic with N – I degrees of freedom.
Example For the previous example, SSE = (7 – 1)(2.91)² + (7 – 1)(2.51)² + (6 – 1)(2.86)² ≈ 129.51. MSE = SSE/(N – I) = 129.51/17 ≈ 7.62
Notice that MSE is an extension of the pooled sample variance in the two-sample case. Therefore, a good estimate for the pooled sample variance sp² is MSE and a good estimate for the common population standard deviation σ is sp = √(MSE).
Therefore, the F test statistic is formed as the following:
F = MSG/MSE
If the null hypothesis is true and so all of the population means are equal, then this test statistic follows an F distribution I – 1 numerator degrees of freedom and N – I denominator degrees of freedom.
Example For the previous example, F = MSG/MSE = 52.87/7.62 ≈ 6.94.
The null hypothesis is rejected if the p-value is less than or equal to the level of significance α, or if F > F*I-1,N-I,α. The p-value is equal to P(Fdf₁,df₂ > F).
Example For the previous example, the p-value for the hypothesis test is p-value = P(F2,17 > 6.94) = (0.001,0.01), because 6.11 < F < 10.66.
Interpretations and Conclusion For ANOVA, the interpretation for the p-value and the conclusion are written differently.
For the p-value, the interpretation is: "If the mean (context of the mean) for all (I + context of the populations) is equal, then the probability of observing a value of the test statistic F at least as (high/low/extreme) as (F) would be (p-value)."
For the conclusion: "Since p-value (</>) α = (α), the null hypothesis is (rejected/failed to be rejected). At a 100(α)% level of significance, there is (sufficient/insufficient) evidence to conclude that at least one of the population means differs from the other means."
Example For the previous example, the p-value is interpreted: If the mean weight loss for all 3 diet programs is equal, then the probability of observing a value of the test statistic F at least as high as 6.94 would be between 0.001 and 0.01. The conclusion is then: Since p-value < α = 0.05, the null hypothesis is rejected. At a 5% level of significance, there is sufficient evidence to conclude that at least one of the population means differs from the other means.
Example For the previous example, the critical value method is used: The null hypothesis is rejected if F > F*2,17,0.05. Using the F* table, F* = 3.59. Since F = 6.94 > F*, the null hypothesis is rejected.
ANOVA Table The calculations for an analysis of variance F test can be summarized in an ANOVA table:
The total sum of squares, SSG + SSE, denoted as SST, measures the total variation in the data.
Equality of Population Standard Deviations Recall that it must be assumed that all I populations are normally distributed with a common standard deviation σ. Before a hypothesis test is conducted, a researcher should always examine the samples to assess whether these assumptions appear to be valid.
The researcher can get a good idea of what conclusion to expect from the hypothesis test by examining side-by-side box-plots for the I samples. The box-plots should have similar spreads, which helps verify the assumption of a common population standard deviation.
The assumption of normality is more difficult to verify with small samples, but it can be assumed normal if the box-plots are at least approximately symmetric with no outliers.
To check for the equality of population standard deviation, the same rule for a two-sample case is used:
max(sᵢ)/min(sᵢ)
where max(sᵢ) is the largest sample standard deviation out of all samples and min(sᵢ) is the smallest sample standard deviation out of all samples.
Just like with two-sample cases, if the quotient is less than or equal to two, then it can be assumed that the population standard deviations are equal and the F test can be used.
Steps of Conducting the F Test 1. State the level of significance. 2. State the hypotheses. H₀: μ₁ = μ₂ = ... = μᵢ vs. Hₐ: not all of the population means are equal 3. Determine whether the F test can be used. 4. Determine N, I, and the overall mean x̿. 5. Calculate SSG and SSE. 6. Calculate MSG and MSE. 7. Find the test statistic. 8. Construct an ANOVA table. 9. Write the decision rule. 10. Find the p-value, critical value, or confidence interval. 11. Interpret the p-value. (optional) 12. Write the conclusion.
Example Crown Plaza Hotel and Resorts offers special weekend rates at hotels and resorts across the U.S. Samples of properties from four regions of the country are selected, The room rate in dollars for each of the hotels is shown below. A researcher would like to test whether the true mean room rate differs for any of the four regions.
Then Population 1 is the Crown Plaza hotels in the Midwest, Population 2 is the Crown Plaza hotels in the Northeast, Population 3 is the Crown Plaza hotels in the South, and Population 4 is the Crown Plaza hotels in the West. Let the level of significance α = 0.05. H₀: μ₁ = μ₂ = μ₃ = μ₄ Hₐ: not all of the population means are equal The following are the side-by-side box-plots for each sample:
The distributions are all approximately symmetric, and so the assumption of normality appears to be reasonable. It also appears that the Northeast region may have a higher mean than the other regions. The hypothesis test will determine if the observed differences in means is significant. From the data, the following is calculated: n₁ = 5, x̄₁ = 141.80, s₁ = 38.25 n₂ = 7, x̄₂ = 192.14, s₂ = 34.67 n₃ = 9, x̄₃ = 166.22, s₃ = 41.63 n₄ = 6, x̄₄ = 153.67, s₄ = 23.08 Determining whether the F test can be used: max(sᵢ)/min(sᵢ) = 41.63/23.08 ≈ 1.80 < 2 Therefore, since it can be assumed the population standard deviations are equal, the F test can be used. Since there are I = 4 populations, the total sample size is N = 5 + 7 + 9 + 6 = 27. Calculating the overall mean: Σnᵢx̄ᵢ = (5)(141.80) + (7)(192.14) + (9)(166.22) + (6)(153.67) = 4471.98 x̿ = 4471.98/N = 4471.98/27 ≈ 165.63 Calculating the sum of squares for groups and error: SSG = Σnᵢ(x̄ᵢ - x̿)² = (5)(141.80 – 165.63)² + (7)(192.14 – 165.63)² + (9)(166.22 – 165.63)² + (6)(153.67 – 165.63)² ≈ 8620.19 SSE = Σ(nᵢ – 1)sᵢ² = (5 – 1)(38.25)² + (7 – 1)(34.67)² + (9 – 1)(41.63)² + (6 – 1)(23.08)² ≈ 29592.19 Calculating the mean square for groups and error: MSG = SSG/(I – 1) = 8620.19/3 ≈ 2873.40 MSE = SSE/(N – I) = 29592.19/23 ≈ 1286.62 Calculating the F test statistic: F = MSG/MSE = 2873.40/1286.62 ≈ 2.23 The following is the corresponding ANOVA table:
The null hypothesis is rejected if p-value ≤ α = 0.05. p-value = P(F3,23 > 2.23) > 0.10 (since F = 2.23 < 2.34, p-value is greater than 0.10) Since p-value > α = 0.05, the null hypothesis is failed to be rejected. At a 5% level of significance, there is insufficient evidence to conclude that at least one of the population means differs from the other means.
Example For the previous example, the critical value method is used: The null hypothesis is reject if F > F*3,23,0.05, where F* = 3.03. Since F = 2.23 < F*, the null hypothesis is failed to be rejected.
Example For the previous example, suppose the researcher conducted a two-sample test for the two means which are the furthest apart, which is Population 1 and Population 2. If the researcher concludes there is a difference between these two means, then he can say overall that at least one of the four means differs. However, when the researcher conducts the following test at a 5% level of significance: H₀: μ₁ = μ₂ Hₐ: μ₁ ≠ μ₂ The researcher gets a test statistic of t = 2.38 and a p-value of 0.0387. Then his conclusion is that the null hypothesis is rejected. However, this cannot overall conclude the four population means, because the previous example showed that at 5% level of significance, the population means were equal.
The above example shows why it is incorrect to look at the data and compare the two sample means that are the furthest apart from each other and conduct a two-sample t test. It is actually likely to see two of the sample means differ by a high amount. The more samples compared, the more likely some of the samples will have sample means further apart due to the principle of randomness.
Example A provincial education minister wants to determine whether there is any difference in the mean GPAs of students attending each of the province's three large universities. The GPAs of a sample of students from each of the universities are shown as the following:
Then Population 1 is the students in University A, Population 2 is the students in University B, and Population 3 is the students in University C. The following is its corresponding box-plots:
It appears there may be a difference between the mean GPAs for the students from the three universities. Additionally, since the distributions are approximately symmetric, it can be assumed normality is present. From the data, the following is calculated: n₁ = 5, x̄₁ = 2.91, s₁ = 0.344 n₂ = 6, x̄₂ = 3.74, s₂ = 0.471 n₃ = 4, x̄₃ = 3.31, s₃ = 0.596 Determining whether the F test can be used: max(sᵢ)/min(sᵢ) = 0.596/0.344 ≈ 1.73 < 2 Therefore, since it can be assumed the population standard deviations are equal, the F test can be used. Let the level of significance α = 0.10. H₀: μ₁ = μ₂ = μ₃ vs. Hₐ: not all of the population means are equal Since there are I = 3 populations, the total sample size is N = 5 + 6 + 4 = 15. Calculating the overall mean: Σnᵢx̄ᵢ = (5)(2.91) + (6)(3.74) + (4)(3.31) = 50.23 x̿ = 50.23/N = 50.23/15 ≈ 3.35 Calculating the sum of squares for groups and error: SSG = Σnᵢ(x̄ᵢ - x̿)² = (5)(2.91 – 3.35)² + (6)(3.74 – 3.35)² + (4)(3.31 – 3.35)² ≈ 1.89 SSE = Σ(nᵢ – 1)sᵢ² = (5 – 1)(0.344)² + (6 – 1)(0.471)² + (4 – 1)(0.596)² ≈ 2.65 Calculating the mean square for groups and error: MSG = SSG/(I – 1) = 1.89/2 ≈ 0.95 MSE = SSE/(N – I) = 2.65/12 ≈ 0.22 Calculating the F test statistic: F = MSG/MSE = 0.95/0.22 ≈ 4.32 The following is the corresponding ANOVA table:
The null hypothesis is rejected if p-value ≤ α = 0.10. p-value = P(F2,12 > 4.32) = (0.025,0.05) Since p-value < α = 0.10, the null hypothesis is rejected. At a 10% level of significance, there is sufficient evidence to conclude that at least one of the population mean GPAs differs from the other means.
Example For the previous example, the critical value method is used: The null hypothesis is reject if F > F*2,12,0.10, where F* = 2.81. Since F = 4.32 > F*, the null hypothesis is rejected.
Confidence Intervals and ANOVA To find a 100(1 – α)% confidence interval for one of the population means μᵢ, it is constructed as the following:
where sp² = MSE and t* is the upper α/2 critical value with N – I degrees of freedom.
Notice that this confidence interval uses a pooled estimate for the population standard deviation instead of the individual sample standard deviation. This is because one of the assumptions of ANOVA is that all populations have the same population standard deviation σ. To estimate this standard deviation, the pooled estimate is used, which is based on all of the samples rather than just the sample of interest.
Therefore, a better estimate of σ is obtained when all N observations are used to obtain sp rather than when just the nᵢ observations are used to obtain sᵢ.
MSE is the pooled estimate of the common variance σ² and √(MSE) is the pooled estimate of the common standard deviation σ.
Example For the previous example, a 95% confidence interval is constructed for University A's students' true mean GPA: x̄₁ = 2.91 Since α = 0.05, α/2 = 0.025. t* = 2.179 with 12 degrees of freedom. sp² = MSE = 0.22 (2.91 – (2.179)√(0.22/5),2.91 + (2.179)√(0.22/5)) ≈ (2.45,3.37)
Confidence Intervals for Two Samples of I Populations Confidence intervals for the difference between any two population means μᵢ – μj can be constructed as the following:
where t* has N – I degrees of freedom and sp² = MSE.
Example For the previous example, a 90% confidence interval is constructed for the difference in mean GPA for University B and University C: x̄₂ - x̄₃ = 3.74 – 3.31 = 0.43 Since α = 0.10, α/2 = 0.05. t* = 1.782 with 12 degrees of freedom. sp² = MSE = 0.22 sp²/n₂ + sp²/n₃ = 0.22/6 + 0.22/4 ≈ 0.092 √(0.092) ≈ 0.30 (0.43 – (1.782)(0.30),0.43 + (1.782)(0.30)) ≈ (-0.10,0.96)
Analysis of Variance vs. Pooled Two-Sample t Test Since ANOVA can be used for comparing two or more population means, ANOVA can be used as an alternative method for two-sample t test when only two populations need comparing and the population standard deviations are equal.
The following are three relationships between ANOVA's variables and pooled two-sample t test variables: 1. MSE = sp² 2. F = t² 3. p-value for ANOVA = p-value for pooled two-sample t test
Robustness of ANOVA Procedures Like the t procedures, the ANOVA procedures are also robust against non-normality.
ANOVA is also robust against the case of unequal population variances, especially when the sample sizes for each population are equal or approximate to each other.















