1. Definition
- Group Sequential Testing is a statistical method that allows researchers to analyze data at several points (interim looks) during an experiment, before the final sample size is reached.
- At each interim analysis, you can decide whether to:
- Stop early for efficacy (treatment clearly works),
- Stop early for futility (treatment clearly doesn’t work), or
- Continue until the next look or final horizon.
It is widely used in clinical trials and increasingly in A/B testing where early stopping can save resources.
2. Why Not Just Peek?
- If you peek at the data repeatedly without adjustment, you inflate the risk of Type I Error (false positives).
- Example: With α = 0.05, if you check results many times, the chance of finding a “significant” result just by chance can exceed 20–30%.
Group sequential designs solve this problem by using α-spending rules that control the overall Type I Error across multiple looks.
3. How It Works
- Plan in advance how many interim analyses (“groups”) you will have.
- Use an α-spending function to allocate significance thresholds across interim looks.
- Early looks require stricter significance cutoffs (e.g., p < 0.001).
- Later looks are more lenient.
- Stop early if results cross thresholds.
4. Common α-Spending Rules
- O’Brien-Fleming: Very strict early (tiny α), more lenient later.
- Pocock: Equal significance levels at each look (moderately strict throughout).
- Lan-DeMets: Flexible, allows α spending adaptively.
5. Example – Clinical Trial
- Testing new drug vs placebo.
- Planned sample size = 1,000 patients.
- Interim analyses every 250 patients.
- α = 0.05 total (5%).
- O’Brien-Fleming rule:
- Look 1 (250 patients): α = 0.001
- Look 2 (500 patients): α = 0.01
- Look 3 (750 patients): α = 0.02
- Final (1000 patients): α = 0.04
If the p-value at 500 patients = 0.008 → stop early, conclude efficacy.
6. Application to A/B Testing
- Instead of waiting until the fixed horizon, you can:
- Check results at pre-planned intervals (e.g., every 10k visitors).
- Stop the test early if one variant is clearly better (efficacy) or clearly not worth continuing (futility).
- Saves traffic and time while keeping false positive rates under control.
7. Comparison with Traditional & Adaptive Tests
| Method | Stopping Rule | Type I Error Control | Efficiency |
|---|---|---|---|
| Traditional A/B (Fixed-Horizon) | Stop only at end | Yes | May waste resources |
| Naive Peeking | Stop anytime p < α | No (inflates false positives) | Risky |
| Group Sequential Testing | Pre-planned interim looks | Yes (via α-spending) | More efficient |
| Adaptive/Bandit Methods | Continuous adjustment | Different (Bayesian or regret bounds) | Most efficient |
8. Key Takeaways
- Group Sequential Testing = preplanned multiple analyses of accumulating data.
- Uses α-spending to control false positives.
- Allows early stopping (saves time, money, traffic).
- Standard in clinical trials, useful in A/B testing with resource constraints.
In short:
Group Sequential Testing is a method that allows early looks at data with controlled error rates, using α-spending rules like O’Brien-Fleming or Pocock. It’s more efficient than traditional fixed-horizon testing, but requires careful planning.
