1. Definition

  • Group Sequential Testing is a statistical method that allows researchers to analyze data at several points (interim looks) during an experiment, before the final sample size is reached.
  • At each interim analysis, you can decide whether to:
    1. Stop early for efficacy (treatment clearly works),
    2. Stop early for futility (treatment clearly doesn’t work), or
    3. Continue until the next look or final horizon.

It is widely used in clinical trials and increasingly in A/B testing where early stopping can save resources.


2. Why Not Just Peek?

  • If you peek at the data repeatedly without adjustment, you inflate the risk of Type I Error (false positives).
  • Example: With α = 0.05, if you check results many times, the chance of finding a “significant” result just by chance can exceed 20–30%.

Group sequential designs solve this problem by using α-spending rules that control the overall Type I Error across multiple looks.


3. How It Works

  1. Plan in advance how many interim analyses (“groups”) you will have.
  2. Use an α-spending function to allocate significance thresholds across interim looks.
    • Early looks require stricter significance cutoffs (e.g., p < 0.001).
    • Later looks are more lenient.
  3. Stop early if results cross thresholds.

4. Common α-Spending Rules

  • O’Brien-Fleming: Very strict early (tiny α), more lenient later.
  • Pocock: Equal significance levels at each look (moderately strict throughout).
  • Lan-DeMets: Flexible, allows α spending adaptively.

5. Example – Clinical Trial

  • Testing new drug vs placebo.
  • Planned sample size = 1,000 patients.
  • Interim analyses every 250 patients.
  • α = 0.05 total (5%).
  • O’Brien-Fleming rule:
    • Look 1 (250 patients): α = 0.001
    • Look 2 (500 patients): α = 0.01
    • Look 3 (750 patients): α = 0.02
    • Final (1000 patients): α = 0.04

If the p-value at 500 patients = 0.008 → stop early, conclude efficacy.


6. Application to A/B Testing

  • Instead of waiting until the fixed horizon, you can:
    • Check results at pre-planned intervals (e.g., every 10k visitors).
    • Stop the test early if one variant is clearly better (efficacy) or clearly not worth continuing (futility).
  • Saves traffic and time while keeping false positive rates under control.

7. Comparison with Traditional & Adaptive Tests

MethodStopping RuleType I Error ControlEfficiency
Traditional A/B (Fixed-Horizon)Stop only at endYesMay waste resources
Naive PeekingStop anytime p < αNo (inflates false positives)Risky
Group Sequential TestingPre-planned interim looksYes (via α-spending)More efficient
Adaptive/Bandit MethodsContinuous adjustmentDifferent (Bayesian or regret bounds)Most efficient

8. Key Takeaways

  • Group Sequential Testing = preplanned multiple analyses of accumulating data.
  • Uses α-spending to control false positives.
  • Allows early stopping (saves time, money, traffic).
  • Standard in clinical trials, useful in A/B testing with resource constraints.

In short:
Group Sequential Testing is a method that allows early looks at data with controlled error rates, using α-spending rules like O’Brien-Fleming or Pocock. It’s more efficient than traditional fixed-horizon testing, but requires careful planning.