Table of Contents
Introduction to K-R 20
The K-R 20, officially known as the Kuder-Richardson Formula 20, stands as a fundamental statistical measure within psychometrics and measurement theory, designed specifically to estimate the reliability of a test or scale. Reliability, in this context, refers to the consistency and stability of measurement—the degree to which a testing instrument yields similar results upon repeated measurements under consistent conditions. The K-R 20 provides a coefficient of internal consistency reliability, meaning it assesses how well the items within a single test measure the same underlying construct. This measure is indispensable in fields like psychological assessment, educational testing, and medical diagnostics, where the accuracy of instruments dictates critical decisions about individuals or populations. Its development marked a significant advancement, offering a robust alternative to earlier, more cumbersome methods of reliability estimation, particularly for assessments composed of dichotomously scored items.
The application of the K-R 20 is constrained primarily to situations where test items are scored using a dichotomous format, meaning responses can only fall into one of two categories, typically correct or incorrect, or yes or no. This characteristic makes it exceptionally suitable for analyzing achievement tests, ability scales, and certain attitude inventories that utilize binary scoring. Unlike measures designed for continuous rating scales, the K-R 20 leverages the specific mathematical properties inherent in binary data to derive a reliable estimate of internal consistency. Researchers rely heavily on this coefficient to ensure that their instruments are measuring a single, coherent psychological trait, thereby lending empirical support to the validity of any conclusions drawn from the test scores.
Before the introduction of this formula, researchers often relied on the cumbersome split-half method, which involved dividing the test into two halves, correlating the scores, and then adjusting the correlation using the Spearman-Brown prophecy formula. The ingenuity of the K-R 20 lies in its ability to simultaneously consider all possible ways of splitting the test items, thereby yielding an average of all split-half coefficients without the need for multiple calculations or arbitrary divisions. It effectively estimates the expected correlation between two random samples of items drawn from the same domain, providing a powerful, single-administration metric that significantly streamlined the validation process for psychological and educational tests.
Historical Context and Development
The need for a single, comprehensive measure of reliability became acutely apparent in the early 20th century as standardized testing gained prominence. Prior measurement practices, such as test-retest reliability, suffered from practice effects or instability over time, while the aforementioned split-half method introduced subjectivity in how the test was divided. Researchers sought a mathematical solution that could quantify internal consistency objectively and efficiently from a single administration of the test. This intellectual climate set the stage for one of the most important methodological advancements in psychometrics.
The K-R 20 was formally introduced in 1937 by two pioneering statisticians, Frederick Kuder and Mary Richardson. Their seminal work, published in the journal Psychometrika, addressed the fundamental theoretical challenge of estimating test reliability. They recognized that the reliability of a test is intrinsically linked to the consistency of item responses. By focusing on the variance attributable to individual items versus the variance of the total test score, Kuder and Richardson developed a formula that provided a rigorous, mathematically grounded estimate of internal consistency. Their work provided an immediate and essential tool for measurement specialists, quickly becoming a cornerstone of test development methodology.
The formula rapidly gained acceptance because it solved several logistical problems simultaneously. It ensured that the reliability estimate was independent of the specific way the test might be split, removing the subjective bias inherent in the split-half method. Furthermore, it provided a lower bound estimate of the true reliability of the test, meaning the actual reliability is likely equal to or higher than the calculated K-R 20 value. This conservative estimation strategy instilled greater confidence in the reported reliability coefficients, solidifying its place as a standard measure in academic and professional research across psychology, education, and health sciences.
Mathematical Definition and Interpretation
The K-R 20 coefficient, denoted as $r_{KR20}$, is derived from a relatively straightforward, yet powerful, mathematical relationship involving the number of items on the test, the variance of the total test score, and the variance associated with each individual item. The formula is expressed as follows:
$$r_{KR20} = left(frac{k}{k-1}right) left[1 – frac{sum p_i q_i}{sigma_x^2}right]$$
Where $k$ represents the total number of items on the test; $p_i$ represents the proportion of individuals who answered item $i$ correctly; $q_i$ represents the proportion of individuals who answered item $i$ incorrectly (where $q_i = 1 – p_i$); and $sigma_x^2$ represents the variance of the observed total test scores. The term $sum p_i q_i$ is crucial, as $p_i q_i$ represents the variance of a single dichotomous item. Thus, the numerator in the main fraction represents the sum of the variances of all individual items.
Conceptually, the K-R 20 calculation compares the observed variability of the total test scores ($sigma_x^2$) against the variability that would be expected if the items were perfectly independent of one another (the sum of item variances, $sum p_i q_i$). If the test is highly reliable, the items should covary positively, meaning the total test variance should be much larger than the sum of the individual item variances. A large discrepancy between these two variance components results in a high reliability coefficient. Conversely, if the items are measuring different things inconsistently, the total variance will be closer to the sum of the individual variances, resulting in a low K-R 20 value.
The resulting K-R 20 coefficient always ranges between 0 and 1. A coefficient close to 1.00 signifies extremely high reliability, indicating that the test items are highly consistent and the measurement error is minimal. Conversely, a coefficient near 0.00 suggests that the test is highly inconsistent, and scores are likely due to random error rather than reflecting the underlying construct. In practice, reliability coefficients must typically exceed a certain threshold—often 0.70 or 0.80—to be considered acceptable for research purposes, and values above 0.90 are often required for high-stakes decisions, such as those made in clinical diagnostics or personnel selection. Researchers must interpret the magnitude of the K-R 20 coefficient based on the specific context and purpose of the measurement instrument being evaluated.
Prerequisites and Assumptions
The effective application of the K-R 20 formula relies on fulfilling specific methodological and statistical assumptions. The most critical requirement, as previously noted, is that all items on the test must be dichotomously scored. This means that for every item, there are only two possible outcomes, usually coded as 0 (incorrect/absent) and 1 (correct/present). If a test includes items that utilize a Likert scale or any other continuous range of responses (e.g., 1 to 5), the K-R 20 is mathematically inappropriate and will yield a misleading estimate of reliability. In such cases, Cronbach’s Alpha (Coefficient Alpha), a more generalized reliability measure, must be employed instead.
Another fundamental assumption underlying the K-R 20, shared generally with Cronbach’s Alpha, is tau-equivalence. Tau-equivalence assumes that all items measure the same latent construct (unidimensionality) and that they measure this construct with equal precision or intensity. While strict tau-equivalence is rarely achieved perfectly in practice, the model requires that the differences in item variances are primarily due to random measurement error, rather than systematic differences in the item content or weighting. If a test is highly heterogeneous (i.e., measuring multiple distinct constructs), the K-R 20 will typically underestimate the true reliability, signaling a problem with the test design rather than solely reflecting measurement inconsistency.
Furthermore, the K-R 20 implicitly assumes that the errors associated with different items are independent of one another. That is, an individual’s error on one item should not influence their error on any other item. Violations of this assumption, such as when items are grouped or follow a sequential pattern that introduces dependency (e.g., a passage followed by multiple related questions), can inflate or deflate the reliability estimate. Researchers utilizing the K-R 20 must carefully construct their instruments to minimize item dependence and strive for a test structure that reflects a single, coherent psychological dimension to ensure the accuracy and interpretability of the resulting coefficient.
Advantages and Limitations
One of the primary advantages of the K-R 20 is its efficiency and objectivity. By requiring only a single administration of the test, it avoids the temporal instability issues associated with test-retest reliability and eliminates the arbitrary nature of item splitting inherent in the split-half method. Furthermore, because the formula calculates the reliability based on the covariance structure of the items, it provides a comprehensive estimate that is statistically superior to measures relying on simple correlation between halves. This computational elegance allows researchers to obtain a robust reliability estimate quickly, accelerating the test development and validation cycle, particularly for large-scale assessments.
Despite its strengths, the K-R 20 possesses distinct limitations tied strictly to its mathematical derivation. The most significant limitation is its rigid requirement for dichotomous scoring. If a researcher attempts to apply K-R 20 to continuous or polytomous data, the resulting coefficient will be mathematically incorrect and meaningless for interpreting internal consistency. Moreover, the K-R 20 can be sensitive to the breadth of item difficulty. If a test contains many items that are either extremely easy (almost everyone answers correctly) or extremely difficult (almost no one answers correctly), the variance of those items ($p_i q_i$) approaches zero, which artificially lowers the numerator ($sum p_i q_i$) and may lead to a spurious increase in the calculated reliability coefficient, even if the consistency among medium-difficulty items is low.
Another practical consideration is the relationship between K-R 20 and test length. Like most reliability estimates, the K-R 20 is influenced by the number of items ($k$). Generally, all else being equal, a longer test will yield a higher reliability coefficient. This is because the random errors associated with individual items tend to cancel each other out over a larger number of items, leading to a more stable total score. While this is not strictly a limitation of the formula itself, researchers must be careful not to confuse high reliability due solely to excessive test length with genuine high internal consistency of the construct being measured. The K-R 20 provides the necessary input for calculating the standard error of measurement, which is arguably the most practical output derived from reliability, quantifying the expected margin of error around an individual’s score.
Calculation and Practical Application
The practical calculation of the K-R 20 involves several distinct steps, which are easily handled by modern statistical software but are essential for understanding the underlying mechanics. The initial step requires calculating the proportion of successful responses ($p_i$) and unsuccessful responses ($q_i$) for every single item in the test sample. This information is crucial because the product $p_i q_i$ represents the variance of that specific binary item. If an item is of moderate difficulty (e.g., $p_i = 0.5$), its variance ($0.5 times 0.5 = 0.25$) is maximized, contributing maximally to the sum of item variances. Items with very high or very low difficulty contribute very little variance.
The second vital step involves calculating the variance of the total test scores ($sigma_x^2$). This is calculated by summing up the scores of each individual across all items to get a total score, and then calculating the variance of these total scores across the entire sample. This total test variance represents the overall spread of performance observed in the population. The magnitude of $sigma_x^2$ is central to the K-R 20 calculation because it serves as the denominator in the final reliability term. A larger total variance suggests greater individual differences in the underlying trait, which, when coupled with a relatively low sum of item variances, indicates strong internal consistency.
Once the sum of item variances ($sum p_i q_i$) and the total test variance ($sigma_x^2$) are calculated, these values, along with the number of items ($k$), are substituted into the K-R 20 formula. In practical application, the K-R 20 is most frequently used in the context of educational assessment, such as high-stakes certification exams or standardized achievement tests, where items are definitively scored as correct or incorrect. It provides test developers with crucial evidence to support the claim that the test items function cohesively. For example, a researcher developing a screening tool for a specific cognitive deficit might use K-R 20 to confirm that the pass/fail items are consistently measuring that single deficit across the tested population.
Relationship to Other Reliability Measures
The K-R 20 is intrinsically linked to other measures of internal consistency, most notably the older split-half method. Prior to the K-R 20, split-half reliability was the primary means of estimating internal consistency. However, since the K-R 20 effectively computes the mean of all possible split-half reliabilities, it is considered a far more comprehensive and stable estimate than any single split-half correlation, even one corrected by the Spearman-Brown formula. The K-R 20 offered a superior mathematical framework that eliminated the element of chance introduced by arbitrary splitting of test items.
Crucially, the K-R 20 is mathematically a special case of Cronbach’s Alpha ($alpha$). Coefficient Alpha is the generalized formula for estimating internal consistency, designed for tests where items are scored polytomously or continuously (e.g., Likert scales). When the item responses are strictly dichotomous, the calculation for Cronbach’s Alpha simplifies and becomes algebraically equivalent to the Kuder-Richardson Formula 20. Therefore, while modern statistical software often reports Cronbach’s Alpha regardless of the item format, researchers must understand that if their data is binary, the reported Alpha value is indeed the K-R 20 coefficient. This relationship underscores the foundational importance of Kuder and Richardson’s original work.
Another related, though less frequently used, measure is the Kuder-Richardson Formula 21 (K-R 21). The K-R 21 is a further simplification of the K-R 20, designed for computational ease in the era before widespread computing power. The K-R 21 makes the strong and often unrealistic assumption that all items in the test have equal difficulty (i.e., the proportion correct, $p$, is the same for every item). Because this assumption is rarely met in real-world testing, the K-R 21 typically yields a reliability coefficient that is lower than the K-R 20. Consequently, while K-R 21 is computationally faster, K-R 20 is generally preferred as it does not require the strict assumption of equal item difficulty, making it a more accurate estimate of true internal consistency.
Conclusion and Modern Usage
The K-R 20 remains an indispensable tool in the modern psychometric toolkit, securing its legacy as one of the most significant contributions to the quantitative analysis of measurement instruments. Its precise mathematical definition and clear interpretive guidelines ensure that test developers have a reliable method for vetting the internal structure of assessments comprising binary items. While Cronbach’s Alpha may dominate discussions about internal consistency for scales utilizing continuous response formats, the K-R 20 continues to be the definitive standard for evaluating the reliability of traditional achievement and aptitude tests.
In contemporary research, the K-R 20 facilitates the rigorous development of scales used in high-stakes environments, such as certification exams, standardized university entrance tests, and epidemiological screening instruments. The consistency it measures is crucial for establishing validity—the assurance that a test not only measures something consistently but measures what it purports to measure. Without a high K-R 20 coefficient, any claims about the validity or utility of a dichotomously scored test are severely undermined, emphasizing its role as a necessary prerequisite for sound psychological and educational measurement.
Ultimately, the Kuder-Richardson Formula 20 provides a powerful, single-administration estimate of the reliability ceiling imposed by the internal structure of a test. Its enduring relevance, nearly a century after its inception, highlights the foundational strength of its statistical derivation. Researchers must continue to utilize the K-R 20 whenever they evaluate instruments where responses are scored dichotomously, ensuring the consistency of the items and the trustworthiness of the resulting scores that inform critical decisions about individuals and policy.
References
-
Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The theoretical status of the Kuder-Richardson 20 reliability coefficient. Psychometrika, 69(2), 131-144. https://doi.org/10.1007/bf02296331
-
Kuder, F., & Richardson, M. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151-160. https://doi.org/10.1007/BF02289233
-
Yuan, K. H., & Bentler, P. M. (1999). A comparison of three indices of fit: The Kuder-Richardson 20, the Root Mean Square Error of Approximation, and the Normed Fit Index. Structural Equation Modeling, 6(2), 183-203. https://doi.org/10.1080/10705519909540119
Cite this article
Mohammed looti (2025). K-R 20. Encyclopedia of psychology. Retrieved from https://encyclopedia.arabpsychology.com/k-r-20/
Mohammed looti. "K-R 20." Encyclopedia of psychology, 3 Dec. 2025, https://encyclopedia.arabpsychology.com/k-r-20/.
Mohammed looti. "K-R 20." Encyclopedia of psychology, 2025. https://encyclopedia.arabpsychology.com/k-r-20/.
Mohammed looti (2025) 'K-R 20', Encyclopedia of psychology. Available at: https://encyclopedia.arabpsychology.com/k-r-20/.
[1] Mohammed looti, "K-R 20," Encyclopedia of psychology, vol. X, no. Y, ص Z-Z, December, 2025.
Mohammed looti. K-R 20. Encyclopedia of psychology. 2025;vol(issue):pages.