s

SCREE PLOT



SCREE PLOT: Introduction and Definition

The Scree plot stands as a fundamental graphical tool in multivariate statistics, specifically designed for applications involving dimensionality reduction techniques such as Principal Component Analysis (PCA) and Exploratory Factor Analysis (EFA). Fundamentally, it serves as a visual representation of the variance explained by each successive component or factor extracted from a dataset. By plotting the eigenvalues—which quantify the amount of variance accounted for—against their corresponding component numbers, the Scree plot provides an immediate and intuitive method for assessing the relative importance of each component. This visualization is critical for researchers attempting to distill complex, high-dimensional data into a smaller, more manageable set of variables while retaining the maximum amount of original information. Its robust application across diverse fields underscores its utility in simplifying data structures and guiding subsequent analytical decisions.

The primary objective of employing a Scree plot is to aid the analyst in making the crucial determination regarding component retention: deciding precisely how many components are necessary to adequately explain the structure of the observed data. In many analytical contexts, retaining too many components leads to overfitting and a loss of parsimony, while retaining too few components results in information loss and an incomplete representation of the underlying phenomena. The graphical display inherent to the Scree plot offers an efficient diagnostic test, allowing researchers to observe where the marginal gain in variance explanation diminishes significantly. This visual criterion, often sought after for its simplicity, makes the Scree plot a staple in disciplines ranging from physical sciences to the complex measurement of human behavior.

The nomenclature of the “Scree plot” itself derives from geology, where “scree” refers to the accumulation of loose rock debris that has fallen from a cliff face, forming a slope or talus at the base. Visually, the plot often resembles this geological formation: a steep initial drop (the cliff face) followed by a flattened, gentle slope (the scree). This characteristic shape provides the necessary visual cues for interpretation. The components that fall on the steep initial part of the curve are deemed highly relevant, explaining a substantial portion of the variance, whereas those falling on the gentle slope, or the “scree,” are considered less important, contributing little more than random noise. The Scree plot is a popular tool used extensively in various quantitative disciplines, notably psychometrics, factor analysis, and advanced data mining methodologies, where the identification of underlying structure is paramount.

Theoretical Foundation: Eigenvalues and Variance Decomposition

The theoretical efficacy of the Scree plot hinges upon the principle of variance decomposition, a core concept in multivariate statistics. When analyzing a dataset characterized by multiple correlated variables, the total variance present across all variables can be mathematically partitioned into a set of orthogonal (uncorrelated) components. These components are derived such that the first component accounts for the largest possible amount of variance, the second accounts for the largest remaining variance uncorrelated with the first, and so forth. This process continues until the total variance is fully decomposed. The components are typically extracted using algorithms associated with Principal Component Analysis (PCA) or Exploratory Factor Analysis (EFA), techniques which aim to restructure the observed variable space into a lower-dimensional component space.

Central to this process are the eigenvalues, sometimes referred to as characteristic roots. An eigenvalue represents the total amount of variance explained by its corresponding component or factor. When a PCA or EFA is performed, the output includes a list of eigenvalues, equal in number to the original variables (or components extracted). The crucial analytical step is that these eigenvalues are inherently derived and presented in descending order of magnitude. This descending sequence ensures that the most powerful components—those explaining the most variance—are always presented first. Plotting these values provides the visual foundation of the Scree plot, illustrating the rate at which explanatory power diminishes across the components. A large eigenvalue signifies that the component is highly effective in summarizing the data structure, while an eigenvalue close to zero suggests that the component is largely irrelevant.

The magnitude of the eigenvalues directly informs the analyst about the underlying structure of the data. If a dataset possesses a strong, underlying structure—meaning that the variables are highly correlated and can be effectively summarized by a few latent constructs—the first few eigenvalues will be substantially large, and the subsequent eigenvalues will drop off steeply. This rapid decrease indicates high data efficiency, confirming that the majority of the data’s complexity and variance is captured by only a small subset of the components. Conversely, if the eigenvalues decrease slowly and linearly across many components, it suggests that the data lacks a clear, simple structure, or that the variables are relatively independent, requiring a large number of components to adequately account for the total variance observed. Understanding this relationship between eigenvalue magnitude and variance capture is key to utilizing the Scree plot effectively.

Principal Component Analysis (PCA) and Factor Analysis Context

While often discussed interchangeably in the context of the Scree plot, Principal Component Analysis (PCA) and Factor Analysis (FA) have distinct statistical goals, though both rely on the Scree plot for component or factor retention decisions. PCA is fundamentally a data reduction technique aimed at transforming a set of potentially correlated variables into a smaller set of uncorrelated components, preserving as much variance as possible. In PCA, the components are linear combinations of the observed variables, and the goal is simply efficiency and summarization. The Scree plot helps determine the optimal point of variance preservation versus dimensionality reduction, ensuring that the researcher selects the fewest components necessary to capture the bulk of the total system variability.

Factor Analysis, particularly Exploratory Factor Analysis (EFA), operates under a different conceptual model. EFA posits that the observed correlations among variables are due to underlying, unobserved (latent) factors. Unlike PCA, EFA is concerned with identifying these hidden constructs and understanding the causal relationship between the latent factors and the observed variables. In this context, the eigenvalues plotted on the Scree plot represent the variance explained by the latent factors. The decision guided by the Scree plot here is determining the number of meaningful latent constructs that influence the manifest variables, which is a crucial step in scale development and theoretical model testing in fields like psychometrics.

Despite their theoretical differences, both PCA and EFA utilize the Scree plot because the graphical approach provides a necessary counterbalance to purely mathematical retention rules. For instance, in PCA, if the goal is to explain 80% of the variance, the Scree plot visually confirms whether that 80% is achieved smoothly or whether a sharp drop-off suggests that a slightly lower percentage might be more structurally sound. Similarly, in EFA, while a mathematical criterion might suggest eight factors, the Scree plot might reveal that the decrease in explanatory power after the fourth factor is negligible, leading the researcher to prioritize parsimony and theoretical interpretability over purely maximizing variance retention. Thus, the Scree plot acts as a vital bridge between rigorous mathematical computation and practical, interpretative decision-making regarding dimensionality.

Construction and Visualization of the Scree Plot

Producing a Scree plot is a standardized procedure that follows the initial computation of eigenvalues from a multivariate analysis. The first and most essential step is the execution of either a Principal Component Analysis or a Factor Analysis on the dataset in question. This computational step yields the complete set of eigenvalues, where the number of eigenvalues corresponds to the number of variables analyzed. It is imperative that these eigenvalues represent the total variance explained by each corresponding component. Once these values are obtained, they must be meticulously sorted in descending order, ensuring that Component 1 (the component explaining the most variance) is listed first, followed sequentially by the others.

The visualization phase involves graphing these sorted eigenvalues. Traditionally, the Scree plot is constructed using a two-dimensional Cartesian plane. The horizontal axis (X-axis) represents the component or factor number (e.g., 1, 2, 3, 4…), indicating the rank of the component in terms of variance explained. The vertical axis (Y-axis) represents the magnitude of the eigenvalue associated with that component, quantifying the amount of variance explained. Each component is plotted as a distinct point corresponding to its eigenvalue magnitude. These points are then connected by a line, creating the characteristic curve. The resulting visual output is highly diagnostic: the steepness of the curve illustrates the relative contribution of each component, with a steep slope signifying a rapid drop in explanatory power.

The visual characteristics of the resulting plot provide immediate insights into the data structure. A dataset with a strong, unambiguous factor structure will typically produce a plot where the first few points are dramatically high, followed by a rapid, near-vertical drop. The remaining points will then trace a relatively flat line, forming the “scree” or “rubble” portion of the graph. Conversely, a weak or complex structure might yield a plot where the eigenvalues decrease gradually across many components, making the interpretation more difficult and subjective. The clarity of this visual output is precisely why the Scree plot remains a preferred method for preliminary assessment of dimensionality, even when more statistically robust techniques are available, as it allows the analyst to grasp the underlying data organization quickly.

Interpreting the Scree Plot: The “Elbow” Criterion

The primary method for interpreting the Scree plot and determining the optimal number of components to retain is the identification of the “elbow” criterion. The elbow refers to the point on the curve where the slope dramatically changes from steep (indicating high explanatory power) to relatively flat (indicating negligible additional explanatory power). This transition point signifies the boundary between meaningful components and components that are likely capturing mere error variance or random noise within the data. Components situated before or at the elbow are typically retained, as they explain a substantial and meaningful proportion of the systematic variance.

Specifically, the identification of the elbow is based on observing where the eigenvalues decrease rapidly and then subsequently begin to decrease slowly and almost linearly. If the eigenvalues decrease rapidly after the first few components—for example, a large drop between component 1 and component 2, and then a stabilization between component 3 and component 4—the elbow is generally located at the component preceding the stabilization. This scenario strongly suggests that the data is well-described by a few components, as the majority of the information is captured by the initial few factors. Conversely, if the eigenvalues continue to decrease slowly across numerous components, it implies that many components are needed to explain the data, or that the underlying structure is diffuse, making the location of a clear elbow ambiguous.

However, a major limitation of the elbow criterion is its inherent subjectivity. Different researchers looking at the same Scree plot might reasonably disagree on the exact location of the elbow, especially when the transition is not sharply defined. This subjectivity means the Scree plot is often best used as a heuristic guide rather than a definitive mathematical rule. Analysts frequently use the Scree plot in conjunction with other criteria, such as theoretical interpretability of the retained factors and other mathematical rules, to validate their decision. Despite this limitation, the visual impact of the Scree plot makes it indispensable for providing context and justification for the component retention decision, particularly in applied research settings where a clear explanation of methodological choices is necessary.

Alternative Criteria for Component Retention

While the visual inspection offered by the Scree plot is intuitive and highly valued, its subjective nature necessitates the use of more rigorous, quantitative criteria to confirm component retention decisions. The most commonly cited alternative is the Kaiser Criterion, or the Eigenvalue-Greater-Than-One rule. This criterion mandates that only components or factors with an eigenvalue greater than 1.0 should be retained. The theoretical justification is that an eigenvalue of 1.0 represents the variance explained by a single, average observed variable; therefore, any component explaining less variance than a single variable contributes minimally to the overall structure and should be discarded. Although widely used, the Kaiser criterion is often criticized for over-retaining components, particularly in analyses involving large numbers of variables.

A more statistically robust method, often favored by modern psychometricians, is Parallel Analysis. Unlike the subjective elbow rule or the often-inflating Kaiser rule, Parallel Analysis compares the actual eigenvalues extracted from the dataset against eigenvalues generated from a randomly generated dataset of the same size and characteristics (number of variables and sample size). The rule for retention is straightforward: only components whose actual eigenvalue is larger than the corresponding eigenvalue derived from the random data should be kept. This method effectively filters out components that are merely capturing random sampling error, providing a more accurate and statistically defensible determination of the true underlying dimensionality.

Researchers are strongly encouraged to employ a multifaceted approach when making component retention decisions. The Scree plot provides the initial visual assessment of the steepness and shape of the variance decay. This visual finding should then be tested against quantitative rules, such as the Kaiser Criterion, and ideally confirmed by Parallel Analysis. Furthermore, the final determination must always be grounded in theoretical coherence. If a statistical criterion suggests retaining a component that makes no theoretical sense or cannot be logically interpreted within the field of study, the researcher must carefully evaluate whether that component truly represents a meaningful latent construct or merely statistical artifact. The interplay among visual heuristics, statistical rules, and theoretical rationale leads to the most defensible research outcomes.

Applications Across Disciplines

The utility of the Scree plot extends across a vast range of quantitative fields, primarily serving as a key diagnostic step in multivariate data analysis. In psychometrics, the application is fundamental: Scree plots are routinely used during the development and validation of psychological tests and scales. When researchers administer a new survey intended to measure constructs like anxiety or intelligence, they use EFA to determine how many latent factors (e.g., specific dimensions of anxiety) underlie the responses to the questionnaire items. The Scree plot visually guides the decision regarding the number of factors to retain, ensuring that the resulting scale structure is both statistically sound and conceptually coherent, thereby validating the scale’s internal structure.

Beyond psychometrics, the Scree plot is indispensable in various fields utilizing data mining and machine learning for dimensionality reduction. In genomics, for example, high-dimensional data sets involving thousands of gene expression variables can be simplified using PCA. The Scree plot helps identify the few principal components that capture the majority of the biological variation, allowing subsequent analyses (like classification or regression) to be performed on a smaller, cleaner set of features, thus reducing computational load and mitigating the risk of overfitting. Similarly, in market research, the plot assists in identifying the key latent components that drive consumer preference, simplifying complex multidimensional preference data into actionable clusters or factors.

Furthermore, in fields like ecology and environmental science, where complex systems involve numerous interrelated variables (e.g., weather patterns, pollution levels, species diversity), PCA is frequently employed to summarize these variables. The Scree plot allows ecologists to swiftly determine the most influential environmental gradients or factors, enabling them to focus their research and modeling efforts on the few components that genuinely explain the systemic variance. This wide applicability demonstrates that the Scree plot is not merely an academic tool but a practical necessity for any researcher aiming to simplify complex systems and extract meaningful, parsimonious insights from large, multivariate datasets.

Conclusion

The Scree plot remains a critical and highly valuable tool in the domain of multivariate statistics, particularly for guiding decisions in Principal Component Analysis and Factor Analysis. It provides an effective graphical representation of the variance explained by each sequential component, plotting eigenvalues in descending order against their corresponding component number. This visualization allows researchers to observe the rate of decline in explanatory power and efficiently estimate the inherent dimensionality of their dataset. Its strength lies in its intuitive nature, providing immediate feedback on whether the data structure is compact (characterized by a steep drop-off) or diffuse (characterized by a gradual decline).

The methodological utility of the Scree plot is centered on the identification of the “elbow” point, which marks the transition from significant variance explained to marginal contributions. This point is interpreted as the optimal number of components that explain the majority of the systematic variance in the data. While the interpretation of the elbow can be subjective, the plot serves as an essential heuristic, which should ideally be triangulated with quantitative metrics such as the Kaiser Criterion (eigenvalues > 1.0) and Parallel Analysis to ensure a robust and defensible component retention decision.

In summary, the Scree plot is widely used across disciplines including psychometrics, factor analysis, and data mining to achieve parsimony and clarity in complex data structures. It is a fundamental step in determining the number of factors that underlie a measurement instrument or the optimal number of dimensions for data reduction. By synthesizing visual intuition with mathematical output, the Scree plot empowers analysts to move confidently from high-dimensional complexity to simplified, interpretable models, making it an enduring fixture in the analytical toolkit.

References

The following works provide foundational and applied context regarding the theory and practical application of the Scree plot in statistical modeling and psychometrics:

  • Chen, J., Wang, Y., & Wang, Y. (2017). Exploratory factor analysis of data with a scree plot. International Journal of Psychological Studies, 9(2), 1–7. https://doi.org/10.5539/ijps.v9n2p1
  • Costello, A. B., & Osborne, J. W. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10(7). https://eric.ed.gov/?id=EJ768224
  • Fay, S. (2016). Scree plot analysis. Retrieved from http://www.statisticshowto.com/scree-plot-analysis/