Definition

IID stands for Independent and Identically Distributed, a fundamental assumption in statistics and machine learning.

  • Independent → each data point does not depend on any other.
  • Identically Distributed → all data points come from the same probability distribution.

In short:

$X_1, X_2, \dots, X_n \sim \text{i.i.d. from distribution } P(X)$


Breakdown

  1. Independence
    • Knowing one sample gives no information about another.
    • Example: flipping a fair coin 10 times → each flip is independent.
  2. Identical Distribution
    • All samples follow the same distribution (same mean, variance, etc.).
    • Example: every coin flip has the same probability $P(H) = 0.5, P(T) = 0.5$.

Examples

  • IID data (good case)
    • Random sampling from a population (e.g., survey respondents chosen randomly).
  • Non-IID data (bad case)
    • Time series: today’s stock price depends on yesterday’s → not independent.
    • Changing population: early customers vs. late customers may have different distributions → not identically distributed.
    • Grouped data: multiple entries from the same patient → correlated, not independent.

Why It Matters

  • Many algorithms (linear regression, logistic regression, hypothesis tests, neural nets at the start) assume IID.
  • IID assumption makes probability theory simpler (law of large numbers, central limit theorem).
  • Violations (non-IID data) → biased estimates, overconfident predictions.

In Machine Learning

  • Training Data: Often assumed IID (each sample independent, same distribution).
  • Reality: Often violated (autocorrelation in time series, drift in production, data leakage).
  • Handling non-IID requires special methods (time-series CV, grouped CV, domain adaptation).

Summary
IID = Independent and Identically Distributed.

  • Independent = no correlation between samples.
  • Identically distributed = same probability distribution.
  • Assumption makes math/statistics tractable, but often violated in real-world data (time series, grouped data, distribution shifts).