Thompson Sampling

Thompson Sampling

Thompson Sampling is a Bayesian bandit algorithm that maintains a belief over possible reward models, samples from that belief, and chooses the action that looks best under the sampled model.

Key points

  • Slivkins introduces Thompson Sampling in the chapter on Bayesian bandits, describing it as an important algorithm known to perform well in both theory and practice [src-019].
  • The algorithm turns uncertainty into action randomisation: arms with plausible high reward are sampled more often, while uncertainty naturally drives exploration [src-019].
  • In the book’s structure, Thompson Sampling connects Bayesian priors, regret analysis, and prior-independent variants for the original Stochastic Bandits setting [src-019].
  • Practically, Thompson Sampling is often appealing because its exploration is probability-matched rather than rule-based: the probability of choosing an arm tracks how likely it is to be optimal under the current belief [src-019].
  • Yildirim explains the contextual-bandit version as a model that returns posterior distributions of expected outcomes for condition-context pairs, then chooses according to which condition is most likely to produce the highest expected outcome for the observed context [src-021].
  • In practice, the model update often happens in batches rather than after every individual sample, reducing noisy updates in live experimentation systems [src-021].
  • AB Tasty presents Thompson Sampling as a practical bandit option for experimentation traffic allocation: stronger-performing variations receive more visits, while poor-performing variations receive fewer [src-022].

Related entities

Related concepts

Source references

  • [src-019] Aleksandrs Slivkins — “Introduction to Multi-Armed Bandits” (2019-04-15; revised 2024-04-03)
  • [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)
  • [src-022] AB Tasty — “Multi-Armed Bandits: A/B Testing with Fewer Regrets”