跳转至

3.3.5 Outliers and Data Cleaning

Identifying outliers using statistical methods and understanding their impact on summary statistics.

定义

An outlier is a data point that differs significantly from other observations in a dataset. Outliers can be identified using statistical methods such as the Interquartile Range (IQR) method or z-score analysis. A data point is typically considered an outlier if it falls below \(Q_1 - 1.5 \times IQR\) or above \(Q_3 + 1.5 \times IQR\), where \(Q_1\) is the first quartile, \(Q_3\) is the third quartile, and \(IQR = Q_3 - Q_1\). Alternatively, using z-scores, a value is often flagged as an outlier if \(|z| > 3\) (or sometimes \(|z| > 2.5\)). Outliers can result from measurement errors, data entry mistakes, or genuine extreme values. Understanding outliers is crucial because they can significantly distort summary statistics like the mean, standard deviation, and correlation coefficients. Data cleaning involves identifying, investigating, and appropriately handling outliers—either by removing them, transforming them, or keeping them if they represent valid observations.

核心公式

  • \(\text{Outlier if } x < Q_1 - 1.5 \times IQR \text{ or } x > Q_3 + 1.5 \times IQR\)
  • \(IQR = Q_3 - Q_1\)
  • \(z = \frac{x - \bar{x}}{s}\)
  • \(\text{Outlier if } |z| > 3 \text{ (or } |z| > 2.5\text{)}\)
  • \(\text{Effect on mean: } \bar{x}_{\text{with outlier}} \neq \bar{x}_{\text{without outlier}}\)

易错点

  • ⚠️ Automatically removing all outliers without investigating their cause—outliers may represent legitimate extreme values that should be retained for accurate analysis
  • ⚠️ Confusing the IQR method with z-score method and applying incorrect thresholds (e.g., using \(|z| > 1\) instead of \(|z| > 3\)) to identify outliers
  • ⚠️ Failing to recognize that outliers have a much larger effect on the mean than on the median, leading to incorrect conclusions about which summary statistic is more robust
  • ⚠️ Not considering the context of the data when deciding whether to remove outliers—in some cases (like income data), extreme values are valid and important for the analysis