Cook’s D is a statistic used to measure the influence of a single data point on a regression model. It is used to identify influential outliers in a dataset. Cook’s D is the ratio of the change in the sum of squared residuals to the total sum of squares when one data point is removed from a regression model. Cook’s D is usually calculated for each data point of a dataset to identify the outliers that have the greatest influence on a regression model.

In a regression model, the sum of squared residuals (SSR) is a measure of the error between the model and the observed data points. Cook’s D is the ratio of the change in the SSR when one data point is removed from the regression model to the total sum of squares (TSS). Cook’s D is used to identify the outliers that have the greatest influence on the regression model.

The formula for Cook’s D is:

Cook’s D = (SSR – SSR’)/TSS

where SSR is the sum of squared residuals of the regression model with all data points, and SSR’ is the sum of squared residuals of the regression model with one data point removed.

In general, a Cook’s D value greater than 1 indicates that the data point is an influential outlier. However, the decision of whether to remove an outlier from a regression model should not be based solely on the value of Cook’s D. Other factors, such as the distribution of the data points and the research objectives, should also be considered.

Cook’s D can be used to identify influential data points in a dataset. It is a useful statistic for detecting outliers and understanding the influence of individual data points on a regression model.

References

Cook, R. D. (1979). Detection of influential observation in linear regression. Technometrics, 21(1), 15-18.

Fox, J. (2008). Applied regression analysis and generalized linear models. Sage Publications.