Thompson Sampling
Thompson Sampling is a Bayesian bandit algorithm that maintains a belief over possible reward models, samples from that belief, and chooses the action that looks best under the sampled model.
Key points
- Slivkins introduces Thompson Sampling in the chapter on Bayesian bandits, describing it as an important algorithm known to perform well in both theory and practice [src-019].
- The algorithm turns uncertainty into action randomisation: arms with plausible high reward are sampled more often, while uncertainty naturally drives exploration [src-019].
- In the book’s structure, Thompson Sampling connects Bayesian priors, regret analysis, and prior-independent variants for the original Stochastic Bandits setting [src-019].
- Practically, Thompson Sampling is often appealing because its exploration is probability-matched rather than rule-based: the probability of choosing an arm tracks how likely it is to be optimal under the current belief [src-019].
- Yildirim explains the contextual-bandit version as a model that returns posterior distributions of expected outcomes for condition-context pairs, then chooses according to which condition is most likely to produce the highest expected outcome for the observed context [src-021].
- In practice, the model update often happens in batches rather than after every individual sample, reducing noisy updates in live experimentation systems [src-021].
- AB Tasty presents Thompson Sampling as a practical bandit option for experimentation traffic allocation: stronger-performing variations receive more visits, while poor-performing variations receive fewer [src-022].
Related entities
Related concepts
- Multi-Armed Bandits
- Exploration-Exploitation Trade-off
- Stochastic Bandits
- Contextual Bandits
- Epsilon-Greedy
- Upper Confidence Bound
- Dynamic Traffic Allocation