跳转至

3.6.3 Statistical Measures and Distributions

Analyzing measures of central tendency and spread (mean, median, mode, standard deviation) and understanding normal distributions.

定义

Statistical measures and distributions are fundamental tools for analyzing and interpreting data. Measures of central tendency describe the center of a dataset: the mean (\(\bar{x}\)) is the arithmetic average of all values; the median is the middle value when data is ordered; and the mode is the most frequently occurring value. Measures of spread describe how data is distributed: range is the difference between maximum and minimum values; variance (\(s^2\) or \(\sigma^2\)) measures the average squared deviation from the mean; and standard deviation (\(s\) or \(\sigma\)) is the square root of variance, expressing spread in the original units. A normal distribution (Gaussian distribution) is a symmetric, bell-shaped probability distribution characterized by its mean (\(\mu\)) and standard deviation (\(\sigma\)), where approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations (the empirical rule). The z-score standardizes individual data points relative to the mean and standard deviation, allowing comparison across different datasets.

核心公式

  • \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\)
  • \(s^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\) or \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\) (sample variance)
  • \(s = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}\) or \(s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\) (standard deviation)
  • \(z = \frac{x - \mu}{\sigma}\) (z-score)
  • \(P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.68; P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.95; P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.997\) (empirical rule)

易错点

  • ⚠️ Confusing population standard deviation (\(\sigma\)) with sample standard deviation (\(s\)): using \(n\) instead of \(n-1\) in the denominator when calculating sample variance/standard deviation, which leads to underestimating variability in a sample
  • ⚠️ Misinterpreting the meaning of standard deviation: thinking it represents the average deviation from the mean (which would be mean absolute deviation) rather than the square root of average squared deviations
  • ⚠️ Incorrectly applying the empirical rule: assuming that exactly 68%, 95%, and 99.7% of data falls within 1, 2, and 3 standard deviations for any distribution, when this rule only applies to normal distributions
  • ⚠️ Making errors with z-scores: forgetting to subtract the mean before dividing by standard deviation, or using the wrong mean/standard deviation values when comparing data from different populations