跳转至

3.4.5 Residuals and Model Fit

Analyzing residuals to assess how well a linear model fits the data and identify outliers or patterns.

定义

A residual is the difference between an observed value and the predicted value from a linear regression model. For a data point \((x_i, y_i)\), the residual is defined as \(e_i = y_i - \hat{y}_i\), where \(\hat{y}_i\) is the predicted value from the regression line \(\hat{y} = a + bx\). Residuals are used to assess the goodness of fit of a linear model by examining whether the model's predictions are accurate and whether the underlying assumptions of linear regression are satisfied. A good linear model should have residuals that are randomly scattered around zero with no apparent pattern, indicating that the linear model captures the relationship in the data effectively. By analyzing residual plots (scatter plots of residuals versus fitted values or predictor variables), we can identify outliers, detect violations of linearity assumptions, assess homogeneity of variance, and determine whether the model is appropriate for the data.

核心公式

  • \(e_i = y_i - \hat{y}_i\)
  • \(\hat{y} = a + bx\)
  • \(\text{Sum of Squared Residuals (SSE)} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} e_i^2\)
  • \(R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)
  • \(\text{Standard Error of Residuals} = \sqrt{\frac{\sum e_i^2}{n-2}}\)

易错点

  • ⚠️ Confusing the direction of residuals: Students often forget that a residual is (observed - predicted), not (predicted - observed). A positive residual means the actual value is above the regression line, while a negative residual means it is below.
  • ⚠️ Misinterpreting residual plots: Students may fail to recognize patterns in residual plots that indicate violations of linearity or constant variance assumptions. For example, a curved pattern suggests the relationship is not linear, and a fan-shaped pattern suggests non-constant variance (heteroscedasticity).
  • ⚠️ Incorrectly identifying outliers: Students sometimes confuse outliers (points far from the regression line in the y-direction) with influential points (points that significantly affect the slope or intercept). A point can be an outlier without being influential, or vice versa.
  • ⚠️ Misunderstanding \(R^2\) interpretation: Students may incorrectly interpret \(R^2\) as the proportion of residuals explained by the model, rather than the proportion of variation in the response variable explained by the predictor variable. Additionally, they may not recognize that a high \(R^2\) does not guarantee the model is appropriate if residual plots show patterns.