跳转至

3.4.2 Correlation and Association

Understanding positive, negative, and no correlation; assessing the strength and direction of linear relationships.

定义

Correlation and association describe the relationship between two variables in a dataset. Correlation measures the strength and direction of a linear relationship between two quantitative variables. A positive correlation indicates that as one variable increases, the other tends to increase; a negative correlation indicates that as one variable increases, the other tends to decrease; and no correlation (or zero correlation) indicates no linear relationship. The correlation coefficient \(r\) quantifies this relationship on a scale from \(-1\) to \(1\), where values closer to \(-1\) or \(1\) indicate stronger linear relationships, and values near \(0\) indicate weaker or no linear relationships. It is crucial to note that correlation does not imply causation—a strong correlation between two variables does not mean that one causes the other.

核心公式

  • \(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\)
  • \(r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\)
  • \(-1 \leq r \leq 1\)
  • \(r^2 = \text{coefficient of determination (proportion of variance explained)}\)
  • \(\text{slope of regression line} = r \cdot \frac{s_y}{s_x}\)

易错点

  • ⚠️ Confusing correlation with causation: Students often incorrectly conclude that a strong correlation between two variables means one causes the other, ignoring the possibility of confounding variables or reverse causation.
  • ⚠️ Misinterpreting the correlation coefficient: Students may not understand that \(r = 0.5\) does not mean a 50% relationship; the strength of correlation is better assessed by \(r^2\) (the coefficient of determination), which represents the proportion of variance in one variable explained by the other.
  • ⚠️ Ignoring the context of outliers: A single outlier can dramatically affect the correlation coefficient, yet students often fail to recognize this influence or consider whether the outlier should be investigated separately.
  • ⚠️ Assuming linear correlation when the relationship is nonlinear: Students may calculate a correlation coefficient near zero and conclude there is no relationship, when in fact a strong nonlinear (curved) relationship exists between the variables.