Multi-Armed Bandits
Multi-armed bandits are a framework for algorithms that make repeated decisions under uncertainty, learning from partial feedback while balancing exploration of uncertain actions against exploitation of actions that currently look best.
Key points
- Slivkins presents bandits as a simple but powerful framework for decisions over time under uncertainty, with applications spanning computer science, operations research, economics, and statistics [src-019].
- The central structure is sequential choice with feedback: the algorithm chooses an arm/action, observes feedback for that choice, and must use that evidence to improve later choices [src-019].
- The book treats bandits as a family of models rather than one method: IID rewards, Bayesian priors, similarity information, full-feedback and adversarial variants, linear/semi-bandit feedback, contextual bandits, games, budget constraints, and incentive-compatible exploration [src-019].
- For product and AI systems, the practical question is not “use bandits or not”, but which assumptions match the environment: stable rewards, adversarial changes, contextual signals, budget constraints, or strategic participants [src-019].
- In the sponsored-search formulation, the same bandit loop becomes query-conditioned: the system observes a query, chooses an ad, receives click/no-click feedback, and accumulates regret against an oracle that knows the best ad for each query [src-020].
- Yildirim’s practitioner distinction: an A/B test uses static allocation, a multi-armed bandit adapts allocation globally, and a contextual bandit adapts allocation by user/context characteristics [src-021].
- AB Tasty’s experimentation framing is conversion-regret focused: bandits reduce the cost of sending traffic to underperforming variants by reallocating traffic during the test instead of waiting for a fixed A/B test to finish [src-022].
- For conversion-rate optimisation, bandits are especially useful when the goal is short-term reward maximisation under time pressure, such as limited offers, short-lived content, many variants, or high opportunity cost per lost conversion [src-022].
- Hightouch places multi-armed bandits inside AI Decisioning as the mechanism for balancing exploration of new marketing actions with exploitation of actions that already drive outcomes [src-023].
- Hightouch’s RL article positions bandits as the next layer after reinforcement learning: once the system can learn from outcomes, bandits decide which actions to try next without testing everything equally or freezing on the first thing that works [src-024].
- Hightouch’s marketer-focused MAB article adds the combinatorial testing point: 5 subject lines, 4 send times, 3 offer types, and 2 creative templates already create 120 possible combinations, which is why manual A/B testing one or two variables at a time is too slow [src-025].
- In marketing, each arm can be a subject line, send time, creative template, offer, channel, content theme, or a combination of these decision dimensions [src-025].
- Standard MABs optimise toward the best option on average; they become true 1:1 personalisation only when combined with individual customer data through Contextual Bandits [src-025].
- Hightouch’s contextual-bandit article sharpens the boundary: standard MABs answer “what works best for everyone?”, while contextual bandits answer “what works best for this person, right now?” [src-026].
- Braze adds a platform-operations view: a bandit is a decisioning approach inside tools such as Braze, with models such as Upper Confidence Bound and Thompson Sampling run automatically after the marketer defines an outcome such as clicks or conversions [src-027].
- In Braze’s marketing framing, A/B testing validates overall direction while bandits adapt and refine live campaign elements such as images, headlines, buttons, offers, channels, and timing [src-027].
Related entities
- Aleksandrs Slivkins — author of the ingested textbook source
- Braze
Related concepts
- Exploration-Exploitation Trade-off
- Stochastic Bandits
- Thompson Sampling
- Contextual Bandits
- Adversarial Bandits
- Bandits with Knapsacks
- Lipschitz Contextual Bandits
- Sponsored Search Ad Ranking
- Treatment Personalisation
- Dynamic Traffic Allocation
- A/B Testing vs Bandits
- AI Decisioning
- Reinforcement Learning for Marketing
- Marketing Bandit Optimisation
- Agentic Marketing
- Customer Feature Matrix
- Intelligent Selection
Source references
- [src-019] Aleksandrs Slivkins — “Introduction to Multi-Armed Bandits” (2019-04-15; revised 2024-04-03)
- [src-020] Tyler Lu, David Pál, Martin Pál — “Contextual Multi-Armed Bandits” (AISTATS 2010)
- [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)
- [src-022] AB Tasty — “Multi-Armed Bandits: A/B Testing with Fewer Regrets”
- [src-023] Hightouch — “Under the hood of AI Decisioning, part one: Overcoming the personalization gap”
- [src-024] Hightouch — “Under the hood of AI Decisioning, part two: Reinforcement learning”
- [src-025] Hightouch — “Under the hood of AI Decisioning, part three: Multi-armed bandits”
- [src-026] Hightouch — “Under the hood of AI Decisioning, part four: Contextual bandits”
- [src-027] Team Braze — “What is a multi-armed bandit? Smarter experimentation for real-time marketing”
Recommended next
Keep reading from this thread
From 494 indexed pages and articles.
- Wiki concept Exploration-Exploitation Trade-off The exploration-exploitation trade-off is the core tension in bandit problems: an algorithm must spend some decisions learning about uncertain actions while also taking actions Related by bandits
- Wiki concept Adversarial Bandits Bandit problems where rewards or costs are not assumed to come from fixed IID distributions, so algorithms must perform well even Related by bandits
- Insight Recommendation Systems in Production How recommendation systems become production decisioning systems through signals, ranking, constraints, feedback loops, and experimentation Readers have engaged with this next