3.4.6 Causation vs. Correlation¶
Distinguishing between correlation and causation, and understanding limitations of inferring cause-and-effect from observational data.
定义¶
Causation vs. Correlation is the distinction between two types of relationships between variables:
Correlation refers to a statistical relationship between two variables where they tend to move together in a predictable pattern. A correlation coefficient \(r\) measures the strength and direction of a linear relationship, ranging from \(-1\) to \(1\). A correlation of \(r = 0.8\) means the variables move together strongly, but does not imply one causes the other.
Causation means that changes in one variable (the independent variable) directly cause changes in another variable (the dependent variable). This requires a mechanism of action and temporal precedence—the cause must occur before the effect.
Key Distinction: A strong correlation between two variables does NOT prove that one variable causes changes in the other. There are several alternative explanations for observed correlations: - Reverse Causation: Variable B might cause Variable A, not the other way around - Confounding Variables: A third variable C might cause changes in both A and B, creating a spurious correlation - Coincidence: The correlation might be purely due to chance, especially with large datasets
Inferential Limitations: Observational data (data collected without experimental manipulation) can only establish correlation. To establish causation, researchers typically need: 1. A randomized controlled experiment where the independent variable is manipulated 2. Control of confounding variables 3. A plausible mechanism explaining the causal relationship 4. Temporal precedence (cause before effect) 5. Dose-response relationship (more of the cause produces more of the effect)
核心公式¶
- \(r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\)
- \(\text{Correlation coefficient: } -1 \leq r \leq 1\)
- \(\text{Coefficient of determination: } R^2 = r^2 \text{ (proportion of variance explained)}\)
- \(\hat{y} = a + bx \text{ (regression line does not imply causation)}\)
- \(\text{Confounding occurs when: } \text{Variable C} \rightarrow \text{Variable A and Variable B}\)
易错点¶
- ⚠️ Assuming that a strong correlation coefficient (e.g., \(r = 0.92\)) proves causation. Students often conclude 'X causes Y' simply because they are highly correlated, ignoring the possibility of confounding variables or reverse causation.
- ⚠️ Failing to identify confounding variables in observational studies. For example, concluding that ice cream consumption causes drowning deaths without recognizing that warm weather is a confounding variable that increases both.
- ⚠️ Misinterpreting the coefficient of determination \(R^2\) as proof of causation. A high \(R^2\) value means the regression model fits the data well, but it does not establish a causal relationship.
- ⚠️ Overlooking reverse causation. Students may assume Variable A causes Variable B when the data actually shows Variable B causes Variable A (e.g., does depression cause poor sleep, or does poor sleep cause depression?).