3.6.4 Correlation and Causation¶
Distinguishing between correlation and causation, and evaluating whether statistical relationships imply causal relationships.
定义¶
Correlation and Causation refers to the critical distinction between two types of statistical relationships. Correlation is a statistical measure of the strength and direction of a linear relationship between two variables, quantified by the correlation coefficient \(r\), which ranges from \(-1\) to \(1\). A correlation of \(r = 1\) indicates a perfect positive relationship, \(r = -1\) indicates a perfect negative relationship, and \(r = 0\) indicates no linear relationship. Causation, by contrast, means that changes in one variable (the independent variable) directly cause changes in another variable (the dependent variable). The fundamental principle is that correlation does not imply causation: two variables can be strongly correlated without one causing the other. This can occur due to: (1) Confounding variables - a third variable that influences both correlated variables; (2) Reverse causation - the assumed effect might actually be the cause; (3) Coincidence - the correlation may be purely by chance; or (4) Indirect relationships - variables may be related through intermediate variables. Establishing causation requires controlled experiments, randomization, and careful consideration of alternative explanations, not merely observational data showing correlation.
核心公式¶
- \(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\)
- \(r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\)
- \(r^2 = \text{coefficient of determination} = \frac{\text{explained variation}}{\text{total variation}}\)
- \(-1 \leq r \leq 1\)
- \(\text{Correlation coefficient } r \text{ measures linear association, not causation}\)
易错点¶
- ⚠️ Assuming that a strong correlation (high |r| value) proves that one variable causes another. Students often conclude causation from correlation alone without considering confounding variables or alternative explanations.
- ⚠️ Failing to identify confounding variables that could explain the observed correlation. For example, concluding that ice cream sales cause drowning deaths because both increase in summer, without recognizing that warm weather is the confounding variable.
- ⚠️ Ignoring the possibility of reverse causation. Students may assume the direction of causality incorrectly (e.g., assuming depression causes poor sleep rather than poor sleep causing depression).
- ⚠️ Misinterpreting the coefficient of determination (\(r^2\)) as proof of causation. Even if \(r^2 = 0.85\) (85% of variation explained), this describes the strength of association, not whether the relationship is causal.