Multi-Armed Bandits

Multi-Armed Bandits

Multi-armed bandits are a framework for algorithms that make repeated decisions under uncertainty, learning from partial feedback while balancing exploration of uncertain actions against exploitation of actions that currently look best.

Key points

  • Slivkins presents bandits as a simple but powerful framework for decisions over time under uncertainty, with applications spanning computer science, operations research, economics, and statistics [src-019].
  • The central structure is sequential choice with feedback: the algorithm chooses an arm/action, observes feedback for that choice, and must use that evidence to improve later choices [src-019].
  • The book treats bandits as a family of models rather than one method: IID rewards, Bayesian priors, similarity information, full-feedback and adversarial variants, linear/semi-bandit feedback, contextual bandits, games, budget constraints, and incentive-compatible exploration [src-019].
  • For product and AI systems, the practical question is not “use bandits or not”, but which assumptions match the environment: stable rewards, adversarial changes, contextual signals, budget constraints, or strategic participants [src-019].
  • In the sponsored-search formulation, the same bandit loop becomes query-conditioned: the system observes a query, chooses an ad, receives click/no-click feedback, and accumulates regret against an oracle that knows the best ad for each query [src-020].
  • Yildirim’s practitioner distinction: an A/B test uses static allocation, a multi-armed bandit adapts allocation globally, and a contextual bandit adapts allocation by user/context characteristics [src-021].
  • AB Tasty’s experimentation framing is conversion-regret focused: bandits reduce the cost of sending traffic to underperforming variants by reallocating traffic during the test instead of waiting for a fixed A/B test to finish [src-022].
  • For conversion-rate optimisation, bandits are especially useful when the goal is short-term reward maximisation under time pressure, such as limited offers, short-lived content, many variants, or high opportunity cost per lost conversion [src-022].
  • Hightouch places multi-armed bandits inside AI Decisioning as the mechanism for balancing exploration of new marketing actions with exploitation of actions that already drive outcomes [src-023].
  • Hightouch’s RL article positions bandits as the next layer after reinforcement learning: once the system can learn from outcomes, bandits decide which actions to try next without testing everything equally or freezing on the first thing that works [src-024].
  • Hightouch’s marketer-focused MAB article adds the combinatorial testing point: 5 subject lines, 4 send times, 3 offer types, and 2 creative templates already create 120 possible combinations, which is why manual A/B testing one or two variables at a time is too slow [src-025].
  • In marketing, each arm can be a subject line, send time, creative template, offer, channel, content theme, or a combination of these decision dimensions [src-025].
  • Standard MABs optimise toward the best option on average; they become true 1:1 personalisation only when combined with individual customer data through Contextual Bandits [src-025].
  • Hightouch’s contextual-bandit article sharpens the boundary: standard MABs answer “what works best for everyone?”, while contextual bandits answer “what works best for this person, right now?” [src-026].
  • Braze adds a platform-operations view: a bandit is a decisioning approach inside tools such as Braze, with models such as Upper Confidence Bound and Thompson Sampling run automatically after the marketer defines an outcome such as clicks or conversions [src-027].
  • In Braze’s marketing framing, A/B testing validates overall direction while bandits adapt and refine live campaign elements such as images, headlines, buttons, offers, channels, and timing [src-027].

Related entities

Related concepts

Source references

  • [src-019] Aleksandrs Slivkins — “Introduction to Multi-Armed Bandits” (2019-04-15; revised 2024-04-03)
  • [src-020] Tyler Lu, David Pál, Martin Pál — “Contextual Multi-Armed Bandits” (AISTATS 2010)
  • [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)
  • [src-022] AB Tasty — “Multi-Armed Bandits: A/B Testing with Fewer Regrets”
  • [src-023] Hightouch — “Under the hood of AI Decisioning, part one: Overcoming the personalization gap”
  • [src-024] Hightouch — “Under the hood of AI Decisioning, part two: Reinforcement learning”
  • [src-025] Hightouch — “Under the hood of AI Decisioning, part three: Multi-armed bandits”
  • [src-026] Hightouch — “Under the hood of AI Decisioning, part four: Contextual bandits”
  • [src-027] Team Braze — “What is a multi-armed bandit? Smarter experimentation for real-time marketing”

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

  1. Wiki concept Exploration-Exploitation Trade-off The exploration-exploitation trade-off is the core tension in bandit problems: an algorithm must spend some decisions learning about uncertain actions while also taking actions Related by bandits
  2. Wiki concept Adversarial Bandits Bandit problems where rewards or costs are not assumed to come from fixed IID distributions, so algorithms must perform well even Related by bandits
  3. Insight Recommendation Systems in Production How recommendation systems become production decisioning systems through signals, ranking, constraints, feedback loops, and experimentation Readers have engaged with this next