1. Definition
- Bandit algorithms are a family of algorithms for solving the exploration–exploitation trade-off in decision-making.
- They are often called Multi-Armed Bandits (MAB) (like slot machines with multiple arms).
- Goal: Maximize cumulative reward over time by deciding which option (“arm”) to play, balancing:
- Exploration: Try different options to learn their payoff.
- Exploitation: Pick the best-known option so far to maximize reward.
2. Why It Matters
- In A/B testing: you split traffic evenly between options until significance is reached → can “waste” traffic on worse options.
- In Bandit algorithms: traffic allocation adapts dynamically → better options get more users earlier → less revenue loss.
- Used in: online ads, recommendations, pricing optimization, clinical trials, game design, etc.
3. The Core Problem (Exploration vs Exploitation)
- Example: 3 slot machines (arms).
- Arm A → 5% win rate.
- Arm B → 7% win rate.
- Arm C → 6% win rate.
- If you always pick B (best-known), you might miss discovering C is better long-term.
- If you keep exploring randomly, you waste reward.
Bandit algorithms balance both.
4. Popular Bandit Algorithms
1. ε-Greedy
- With probability ε (e.g., 10%), explore a random arm.
- With probability 1 − ε, exploit the current best arm.
- Simple but effective.
2. Upper Confidence Bound (UCB)
- Picks the option with the best upper confidence interval for expected reward.
- Formula balances mean reward and uncertainty.
- Favors exploration when uncertainty is high, but shifts to exploitation as data grows.
3. Thompson Sampling (Bayesian Bandit)
- Models each arm’s reward as a probability distribution (e.g., Beta distribution for Bernoulli rewards).
- Randomly samples from each posterior distribution → selects arm with highest sampled value.
- Naturally balances exploration & exploitation.
- Proven to be highly efficient and widely used in practice.
4. Softmax (Boltzmann Exploration)
- Probability of selecting an arm is proportional to its estimated reward.
- Better arms are more likely but not guaranteed.
- Controlled by a temperature parameter (balances randomness vs greediness).
5. Advantages Over A/B Testing
- A/B Testing:
- Static traffic split (e.g., 50/50).
- Wastes traffic on losing variants until significance.
- Bandits:
- Adaptive allocation → better variants get more users sooner.
- Maximizes cumulative reward during the experiment, not just after.
6. Example: Website Optimization
Suppose an e-commerce site tests 3 call-to-action buttons:
- A: “Buy Now”
- B: “Shop Today”
- C: “Get Yours”
- In an A/B/n test → equal 33/33/33 split until significance.
- With a bandit:
- Start equal.
- After some clicks, B shows higher conversion.
- Algorithm shifts more traffic to B.
- Eventually, maybe C outperforms B → algorithm adjusts again.
End result: more conversions during the test, not just after.
7. Limitations
- More complex to implement than A/B testing.
- Harder to explain to non-technical stakeholders (less “clean” significance reporting).
- Works best when:
- Rewards are immediate and measurable (clicks, conversions).
- Environment is relatively stable (no rapid user preference changes).
8. Comparison: Bandit vs A/B Test
| Feature | A/B Test | Bandit Algorithm |
|---|---|---|
| Traffic split | Fixed (50/50, 25/25/25, etc.) | Adaptive |
| Objective | Identify best option (post-test) | Maximize reward during test |
| Sample size needed | Larger (for significance) | Smaller (adapts early) |
| Risk of revenue loss | Higher (bad variant gets traffic until end) | Lower |
| Interpretability | Simple (p-values, significance) | More complex (Bayesian/posterior, dynamic) |
In short:
Bandit algorithms are adaptive testing methods that balance exploration and exploitation to maximize rewards while learning. Unlike A/B tests that split traffic equally, bandits dynamically allocate more traffic to better-performing options, reducing wasted opportunities.
