1. Definition

  • Bandit algorithms are a family of algorithms for solving the exploration–exploitation trade-off in decision-making.
  • They are often called Multi-Armed Bandits (MAB) (like slot machines with multiple arms).
  • Goal: Maximize cumulative reward over time by deciding which option (“arm”) to play, balancing:
    • Exploration: Try different options to learn their payoff.
    • Exploitation: Pick the best-known option so far to maximize reward.

2. Why It Matters

  • In A/B testing: you split traffic evenly between options until significance is reached → can “waste” traffic on worse options.
  • In Bandit algorithms: traffic allocation adapts dynamically → better options get more users earlier → less revenue loss.
  • Used in: online ads, recommendations, pricing optimization, clinical trials, game design, etc.

3. The Core Problem (Exploration vs Exploitation)

  • Example: 3 slot machines (arms).
    • Arm A → 5% win rate.
    • Arm B → 7% win rate.
    • Arm C → 6% win rate.
  • If you always pick B (best-known), you might miss discovering C is better long-term.
  • If you keep exploring randomly, you waste reward.
    Bandit algorithms balance both.

4. Popular Bandit Algorithms

1. ε-Greedy

  • With probability ε (e.g., 10%), explore a random arm.
  • With probability 1 − ε, exploit the current best arm.
  • Simple but effective.

2. Upper Confidence Bound (UCB)

  • Picks the option with the best upper confidence interval for expected reward.
  • Formula balances mean reward and uncertainty.
  • Favors exploration when uncertainty is high, but shifts to exploitation as data grows.

3. Thompson Sampling (Bayesian Bandit)

  • Models each arm’s reward as a probability distribution (e.g., Beta distribution for Bernoulli rewards).
  • Randomly samples from each posterior distribution → selects arm with highest sampled value.
  • Naturally balances exploration & exploitation.
  • Proven to be highly efficient and widely used in practice.

4. Softmax (Boltzmann Exploration)

  • Probability of selecting an arm is proportional to its estimated reward.
  • Better arms are more likely but not guaranteed.
  • Controlled by a temperature parameter (balances randomness vs greediness).

5. Advantages Over A/B Testing

  • A/B Testing:
    • Static traffic split (e.g., 50/50).
    • Wastes traffic on losing variants until significance.
  • Bandits:
    • Adaptive allocation → better variants get more users sooner.
    • Maximizes cumulative reward during the experiment, not just after.

6. Example: Website Optimization

Suppose an e-commerce site tests 3 call-to-action buttons:

  • A: “Buy Now”
  • B: “Shop Today”
  • C: “Get Yours”
  • In an A/B/n test → equal 33/33/33 split until significance.
  • With a bandit:
    • Start equal.
    • After some clicks, B shows higher conversion.
    • Algorithm shifts more traffic to B.
    • Eventually, maybe C outperforms B → algorithm adjusts again.

End result: more conversions during the test, not just after.


7. Limitations

  • More complex to implement than A/B testing.
  • Harder to explain to non-technical stakeholders (less “clean” significance reporting).
  • Works best when:
    • Rewards are immediate and measurable (clicks, conversions).
    • Environment is relatively stable (no rapid user preference changes).

8. Comparison: Bandit vs A/B Test

FeatureA/B TestBandit Algorithm
Traffic splitFixed (50/50, 25/25/25, etc.)Adaptive
ObjectiveIdentify best option (post-test)Maximize reward during test
Sample size neededLarger (for significance)Smaller (adapts early)
Risk of revenue lossHigher (bad variant gets traffic until end)Lower
InterpretabilitySimple (p-values, significance)More complex (Bayesian/posterior, dynamic)

In short:
Bandit algorithms are adaptive testing methods that balance exploration and exploitation to maximize rewards while learning. Unlike A/B tests that split traffic equally, bandits dynamically allocate more traffic to better-performing options, reducing wasted opportunities.