POST HOC COMPARISON
- Introduction and Definition of Post Hoc Comparison
- Contextualizing the Need: The Omnibus Test Limitation
- The Problem of Multiple Comparisons: Type I Error Inflation
- Assumptions and Prerequisites for Post Hoc Testing
- Overview of Key Post Hoc Procedures
- Tukey’s Honestly Significant Difference (HSD) Test
- The Bonferroni Correction and Related Methods
- Scheffé’s Method and Robustness
- Choosing the Appropriate Test
Introduction and Definition of Post Hoc Comparison
A post hoc comparison, often referred to synonymously as a post hoc contrast, represents a critical class of statistical analyses performed following the initial detection of a statistically significant result in an omnibus test, such as Analysis of Variance (ANOVA) or complex multiple regression analysis. The term itself, derived from Latin, means “after this,” precisely indicating that these comparisons are developed and executed *following* the observation and analysis of the studied data. Unlike planned comparisons, which are formulated based on theoretical hypotheses before data collection begins, post hoc tests are exploratory in nature, designed to pinpoint the precise location of the differences that the initial omnibus test indicated were present somewhere within the group structure.
The fundamental necessity for employing these procedures arises when the null hypothesis of the main statistical model—which states that all group means are equal—is successfully rejected. While rejecting this null hypothesis confirms that at least two of the means are significantly different from one another, it provides no specific detail regarding *which* pairs or combinations of means are responsible for the overall effect. Therefore, the post hoc comparison serves as the necessary follow-up step, partitioning the overall variance to provide more concise and interpretable results than the initial experiment has yielded alone, moving from a general finding of “difference exists” to the specific conclusion of “Group A differs significantly from Group C, but not Group B.”
These comparisons are essential in experimental designs, particularly those involving three or more independent groups or conditions, where researchers seek to understand the nuanced relationships between various levels of an independent variable. Without proper post hoc analysis, the researcher risks an incomplete interpretation of their findings, unable to provide actionable or detailed conclusions about the specific effects of experimental manipulations. The robust application of these methods ensures that the conclusions drawn about specific group differences are statistically sound and appropriately adjusted for the inherent risks associated with performing multiple simultaneous tests.
Contextualizing the Need: The Omnibus Test Limitation
The primary statistical procedures utilized to compare three or more groups, such as the F-test in ANOVA, are known as omnibus tests. An omnibus test evaluates the overall variability across all group means simultaneously, assessing whether the total variance accounted for by the experimental manipulation exceeds the variance attributable to error. When the resulting F-statistic is significant, the researcher concludes that the independent variable has had a measurable effect; however, this conclusion is highly generalized. It merely confirms that $mu_1 = mu_2 = mu_3 = dots = mu_k$ is false, meaning that at least one mean is different from the others, but it fails to specify the pattern of these differences.
Consider an experiment examining the efficacy of three distinct therapeutic interventions (A, B, and C) compared to a control group (D). A significant F-test indicates that the four group means are not identical. This result could stem from numerous scenarios: Intervention A might be superior to all others; only the control group D might differ from the three intervention groups; or perhaps A and B are indistinguishable but both are better than C and D. The omnibus test cannot differentiate between these possibilities. This ambiguity underscores the limitation of the initial test and highlights the indispensable role of post hoc procedures in providing the necessary granular detail required for scientific reporting.
The transition from the omnibus test to post hoc procedures is a logical necessity in the scientific process. The omnibus test acts as a gatekeeper: if the overall effect is not significant, conducting further comparisons is typically unwarranted, as any apparent differences between specific pairs would likely be attributable to random chance. Only upon rejecting the global null hypothesis does the researcher move to the more focused, mean-to-mean comparisons provided by post hoc testing. This sequential approach maintains a structured methodology, ensuring that detailed investigation only occurs when the data broadly support the presence of an effect.
The Problem of Multiple Comparisons: Type I Error Inflation
The most significant challenge inherent in post hoc analysis is the management of the family-wise error rate (FWER), which is the probability of making at least one Type I error across the entire set of comparisons being performed. A Type I error occurs when the researcher incorrectly rejects a true null hypothesis—that is, concluding that a significant difference exists between two means when, in reality, there is none. While the conventional significance level, denoted by $alpha$ (usually set at 0.05), controls the error rate for any single comparison (the per-comparison error rate), this error rate accumulates rapidly as the number of comparisons increases.
If a researcher were to conduct $k$ independent tests, each at an $alpha$ level of 0.05, the probability of avoiding a Type I error across all tests is $(1 – alpha)^k$. Conversely, the probability of making at least one Type I error across the family of tests increases dramatically. For instance, in an experiment with five groups, there are $5(5-1)/2 = 10$ possible pairwise comparisons. If these 10 tests were performed without correction, the FWER would far exceed the desired 0.05, potentially reaching levels where one or more false positives are highly likely. This inflation undermines the reliability of the findings, leading to spurious claims of significance.
To maintain statistical integrity, post hoc methods incorporate various correction techniques designed specifically to control the FWER, ensuring that the probability of making a Type I error across the entire set of comparisons remains at or below the designated alpha level (e.g., 0.05). These adjustments typically involve modifying the critical value required for significance or altering the effective alpha level used for each individual comparison. The choice of which post hoc method to use often hinges on the researcher’s priority: whether they prioritize controlling the FWER stringently (leading to lower statistical power) or prefer a more liberal approach that increases power but accepts a slightly higher risk of Type I errors.
Assumptions and Prerequisites for Post Hoc Testing
Before applying any post hoc procedure, several statistical assumptions must be met, mirroring those required for the initial omnibus test, particularly ANOVA. The validity of the post hoc results is contingent upon the accuracy of these underlying assumptions. The primary prerequisites include the normal distribution of residuals, the independence of observations, and, critically, homogeneity of variance, meaning that the variance within each of the groups being compared should be approximately equal.
If the assumption of homogeneity of variance is severely violated (a common occurrence when sample sizes are unequal), the standard post hoc tests designed for equal variances, such as Tukey’s HSD, may produce unreliable results. In such scenarios, researchers must opt for alternative, more robust tests that do not rely on this assumption, such as the Games-Howell procedure, or utilize adjustments like those provided by Welch’s ANOVA, followed by specialized post hoc tests. Furthermore, the overall significance of the omnibus test is typically considered a non-negotiable prerequisite; conducting post hoc tests when the F-test is non-significant is generally discouraged, as it increases the risk of capitalizing on chance findings, although some specific research traditions may permit limited exploratory testing under certain theoretical justification.
The researcher must also ensure that the design of the experiment matches the requirements of the chosen post hoc technique. For instance, some procedures are specifically designed for pairwise comparisons (comparing every mean against every other mean), while others are suited for more complex non-pairwise contrasts (comparing the average of one set of means against the average of another set). A careful alignment between the research question, the experimental design, and the statistical method is paramount to extracting meaningful and justifiable conclusions from the data. Failure to meet the basic assumptions may require transformation of the data or the use of non-parametric equivalents, if available and appropriate for the research goals.
Overview of Key Post Hoc Procedures
The statistical literature offers a rich variety of post hoc tests, each designed with different statistical rigor and power, making the selection process dependent upon the specific experimental context and the desired control over error rates. These procedures can broadly be categorized based on their method of controlling the FWER and their robustness to assumption violations. Key methods include Tukey’s Honestly Significant Difference (HSD), the Bonferroni correction, and Scheffé’s method, among others like the Newman-Keuls procedure and Duncan’s multiple range test, though the latter two are often viewed as more liberal and less favored in contemporary statistical practice due to concerns about Type I error control.
The decision matrix for selecting a post hoc test involves evaluating several factors: the equality or inequality of sample sizes across groups, whether the researcher is interested only in pairwise comparisons or complex contrasts, and the relative importance of statistical power versus strict error control. Procedures that offer highly stringent control over the FWER, like Bonferroni and Scheffé, often result in lower statistical power, meaning they are less likely to detect a true difference if one exists (increased risk of Type II error). Conversely, more powerful tests, such as Tukey’s HSD under ideal conditions, strike a more balanced approach between Type I and Type II error minimization.
The evolution of statistical software has made the implementation of these complex procedures routine; however, the researcher remains responsible for understanding the mathematical basis and limitations of the chosen method. Misapplication of a procedure—for example, using a method optimized for equal sample sizes when samples are highly unequal—can lead to biased p-values and incorrect scientific conclusions. Therefore, a deep conceptual understanding of the trade-offs inherent in each technique is essential for ethical and accurate reporting.
Tukey’s Honestly Significant Difference (HSD) Test
Tukey’s HSD test is arguably the most widely used and recommended post hoc procedure for pairwise comparisons, particularly when the sample sizes across all groups are equal, a condition known as a balanced design. Developed by John Tukey, the method is rooted in the studentized range distribution ($q$), which accounts for the simultaneous nature of the multiple comparisons. The key advantage of Tukey’s HSD is that it maintains the FWER exactly at the chosen alpha level for all possible pairwise comparisons, assuming the underlying assumptions of ANOVA are met. This strict control makes it a highly reliable tool for identifying true differences among means.
The test calculates a single value, the Honestly Significant Difference (HSD), representing the minimum absolute difference between two group means required for that difference to be considered statistically significant. If the observed difference between any pair of means exceeds the calculated HSD, that difference is declared significant. Because it uses the studentized range statistic, which inherently incorporates the number of groups being compared, Tukey’s HSD is designed specifically for situations where all possible pairwise comparisons are of interest, providing a statistically consistent framework across the entire set of tests.
While Tukey’s HSD is powerful and rigorous for balanced designs, modifications exist for scenarios where sample sizes are unequal, commonly referred to as the Tukey-Kramer procedure. The Tukey-Kramer method adjusts the calculation to accommodate the varying sample sizes, allowing researchers to retain the statistical benefits of the HSD test even in unbalanced experimental settings. Because of its excellent balance between controlling the FWER and maintaining high statistical power relative to more conservative methods, Tukey’s HSD remains the default choice for many researchers conducting standard multi-group experiments.
The Bonferroni Correction and Related Methods
The Bonferroni correction is one of the simplest and most conservative methods used to control the FWER. It is a general method applicable not only to post hoc comparisons in ANOVA but also to any situation involving multiple simultaneous hypothesis tests. The procedure is straightforward: to maintain the family-wise error rate at $alpha$ (e.g., 0.05), the researcher divides the desired FWER by the total number of comparisons ($k$) to determine the new, stricter per-comparison alpha level ($alpha_{new} = alpha / k$).
For example, if 10 pairwise comparisons are planned and the desired FWER is 0.05, each individual test must achieve a p-value less than $0.05 / 10 = 0.005$ to be considered statistically significant. This rigorous adjustment guarantees that the overall probability of committing at least one Type I error across the family of tests is no greater than the original alpha level. While its simplicity and wide applicability are advantages, the Bonferroni correction is highly conservative, especially when the number of comparisons is large or when the individual tests are highly correlated. This conservativeness often leads to a significant loss of statistical power, increasing the risk of failing to detect true differences (Type II errors).
Due to the power limitations of the standard Bonferroni procedure, related, slightly less conservative step-wise methods have been developed, such as the Holm-Bonferroni method (or Holm’s sequential adjustment). Holm’s method is uniformly more powerful than the standard Bonferroni correction while still maintaining strong control over the FWER. It involves ordering the individual p-values from smallest to largest and sequentially applying less stringent adjusted alpha levels, thereby recovering some of the statistical power lost by the basic Bonferroni approach without sacrificing control over Type I error accumulation.
Scheffé’s Method and Robustness
Scheffé’s method is another highly conservative post hoc test, renowned for its robustness and its ability to handle not just pairwise comparisons but also any and all possible complex linear contrasts among the means. A complex contrast might compare the average of treatment groups A and B against the average of treatment groups C and D. Scheffé’s method guarantees that the FWER is controlled at $alpha$ for the infinite set of comparisons and contrasts that could potentially be drawn from the data.
Because Scheffé’s method controls the error rate for every conceivable contrast, it is inherently the most conservative of the major post hoc tests. This high degree of conservatism means that it possesses the lowest statistical power when only simple pairwise comparisons are being considered. Consequently, if a researcher is strictly interested in examining all possible pairwise mean differences, Tukey’s HSD is generally preferred because it offers better power while maintaining the same FWER control for that specific set of comparisons.
However, Scheffé’s method is the appropriate choice when the researcher is interested in exploring complex, non-pairwise contrasts that were not hypothesized a priori, or when the underlying assumptions of ANOVA, particularly homogeneity of variance, are suspect and the sample sizes are unequal. Its robustness to these violations and its ability to handle complex comparisons make it a valuable, albeit infrequently used, tool in exploratory data analysis where the focus is on maximizing protection against Type I errors, even at the cost of reduced power.
Choosing the Appropriate Test
The selection of the appropriate post hoc test is crucial for drawing valid inferences and requires a thoughtful consideration of the experimental design, the statistical assumptions, and the desired balance between Type I and Type II error control. Researchers generally follow a decision hierarchy guided by their specific research goals.
-
If the research goal is to compare all possible pairs of means in an experiment with equal sample sizes and reliable adherence to ANOVA assumptions, Tukey’s HSD is the optimal choice due to its excellent balance of power and FWER control. If sample sizes are unequal, the Tukey-Kramer adjustment should be utilized.
-
If the researcher is concerned about maximum protection against Type I error across a wide range of comparisons, including complex contrasts, or when dealing with highly violated assumptions (though robust alternatives like Games-Howell may be better for severe heterogeneity), Scheffé’s method is recommended, despite its low power for simple pairwise tests.
-
If the researcher has a specific, limited set of pre-planned comparisons that they wish to test post hoc, or if the individual comparisons are highly important, the Bonferroni or, preferably, the Holm-Bonferroni correction may be used. These methods are flexible but typically result in a substantial loss of power compared to Tukey’s HSD when applied to all possible pairwise comparisons.
Ultimately, the expert statistical practitioner understands that no single post hoc test is universally superior. The choice is a strategic one, requiring the researcher to weigh the risks of making a false positive claim (Type I error) against the risk of missing a genuine effect (Type II error), ensuring that the selected method aligns with the ethical and methodological standards of the field.