3.4.3 Linear Regression and Line of Best Fit¶
Finding and interpreting the equation of the line of best fit to model linear relationships between variables.
定义¶
Linear regression is a statistical method for modeling the linear relationship between a dependent variable (response variable) and one or more independent variables (explanatory variables). The line of best fit, also called the regression line or least squares regression line, is the line that minimizes the sum of the squared vertical distances (residuals) between the observed data points and the predicted values on the line. For a simple linear regression with one independent variable, the equation of the line of best fit is \(\hat{y} = a + bx\), where \(\hat{y}\) is the predicted value, \(x\) is the independent variable, \(b\) is the slope, and \(a\) is the y-intercept. The line of best fit passes through the point \((\bar{x}, \bar{y})\), where \(\bar{x}\) and \(\bar{y}\) are the means of the independent and dependent variables, respectively.
核心公式¶
- \(\hat{y} = a + bx\)
- \(b = r \cdot \frac{s_y}{s_x}\)
- \(a = \bar{y} - b\bar{x}\)
- \(r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\)
- \(\text{residual} = y_i - \hat{y}_i\)
易错点¶
- ⚠️ ["Confusing the slope formula: students often forget that the slope \(b = r \cdot \frac{s_y}{s_x}\) depends on both the correlation coefficient and the ratio of standard deviations, not just the correlation coefficient alone.", "Misinterpreting the y-intercept: students may incorrectly interpret the y-intercept as a meaningful value when it falls outside the range of the data, or they may fail to recognize that the intercept may not have practical significance in context.", "Incorrectly calculating residuals: students sometimes compute residuals as \(\hat{y}_i - y_i\) (predicted minus actual) instead of the correct formula \(y_i - \hat{y}_i\) (actual minus predicted), leading to sign errors in interpretation.", "Assuming causation from correlation: students often mistakenly conclude that a strong linear relationship implies causation between variables, when correlation alone does not establish a causal relationship."]