Definition
IID stands for Independent and Identically Distributed, a fundamental assumption in statistics and machine learning.
- Independent → each data point does not depend on any other.
- Identically Distributed → all data points come from the same probability distribution.
In short:
$X_1, X_2, \dots, X_n \sim \text{i.i.d. from distribution } P(X)$
Breakdown
- Independence
- Knowing one sample gives no information about another.
- Example: flipping a fair coin 10 times → each flip is independent.
- Identical Distribution
- All samples follow the same distribution (same mean, variance, etc.).
- Example: every coin flip has the same probability $P(H) = 0.5, P(T) = 0.5$.
Examples
- IID data (good case)
- Random sampling from a population (e.g., survey respondents chosen randomly).
- Non-IID data (bad case)
- Time series: today’s stock price depends on yesterday’s → not independent.
- Changing population: early customers vs. late customers may have different distributions → not identically distributed.
- Grouped data: multiple entries from the same patient → correlated, not independent.
Why It Matters
- Many algorithms (linear regression, logistic regression, hypothesis tests, neural nets at the start) assume IID.
- IID assumption makes probability theory simpler (law of large numbers, central limit theorem).
- Violations (non-IID data) → biased estimates, overconfident predictions.
In Machine Learning
- Training Data: Often assumed IID (each sample independent, same distribution).
- Reality: Often violated (autocorrelation in time series, drift in production, data leakage).
- Handling non-IID requires special methods (time-series CV, grouped CV, domain adaptation).
Summary
IID = Independent and Identically Distributed.
- Independent = no correlation between samples.
- Identically distributed = same probability distribution.
- Assumption makes math/statistics tractable, but often violated in real-world data (time series, grouped data, distribution shifts).
