Contextual Bandits

Contextual Bandits

Contextual bandits are bandit problems where each decision arrives with observable context, and the algorithm learns which actions work best for which contexts from partial feedback.

Key points

  • Slivkins describes contextual bandits as a middle ground between IID and adversarial bandits: rewards can change across rounds, but the change is explained by observed contexts [src-019].
  • The model is useful when actions should be personalised or conditioned on features, such as recommending content, choosing offers, or adapting an interface to a user/session context [src-019].
  • The learning problem is harder than Stochastic Bandits because the algorithm must learn a policy from contexts to actions, not just estimate one reward distribution per arm [src-019].
  • Contextual bandits sit near reinforcement learning conceptually but keep the feedback loop shorter: each round is a context, action, and observed reward rather than a long-horizon Markov decision process [src-019].
  • Lu, Pál, and Pál make the model concrete through sponsored search: a query arrives as context, an ad is chosen as the action, and click feedback is the observed payoff [src-020].
  • Their Lipschitz version adds a metric structure over contexts and actions, allowing reward estimates to generalise across similar queries and similar ads [src-020].
  • Yildirim frames contextual bandits as dynamic Treatment Personalisation: traffic allocation changes over time and by user context, not just globally as in a context-free bandit [src-021].
  • A contextual bandit is often the practical middle ground between static A/B testing and full multi-step reinforcement learning: useful when actions do not materially change future system state [src-021].
  • In production experimentation, contextual bandits also need Offline Policy Evaluation so teams can test candidate policies against logged data before live deployment [src-021].
  • Hightouch frames contextual bandits as the personalisation extension of standard bandits: instead of finding the single best option for everyone, they use customer context such as purchase history and demographic information to choose for the individual [src-023].
  • Hightouch’s MAB article repeats the limitation from the other side: standard multi-armed bandits find winning strategies faster but still optimise for the average customer, so contextual bandits are needed for Sarah-versus-Marcus style individual differences [src-025].
  • Hightouch’s contextual-bandit article makes the operational pipeline explicit: customer data becomes a Customer Feature Matrix, possible actions are combined across dimensions, a model predicts expected reward, and the bandit selects the highest-reward action for that individual [src-026].
  • The article emphasises dual learning: audience-level patterns help similar customers and cold-start cases, while individual-level updates preserve exceptions such as a high-LTV customer who ignores discounts [src-026].
  • At enterprise scale, contextual-bandit systems must process hundreds of customer features, thousands of possible actions, coordinate multiple campaigns, avoid overwhelming customers, and connect with existing ML models [src-026].
  • Braze describes contextual bandits as an added intelligence layer over standard bandits, using signals such as location, device, and past engagement to choose what performs best for a customer in a specific moment [src-027].
  • In Braze’s AI Decisioning Studio framing, contextual bandits work alongside Intelligent Selection and predictive scoring rather than replacing the MAB allocation layer [src-027].
  • Statsig positions contextual multi-armed bandits as an acceleration tool for shallow A/B decisions where the rule is binary, such as picking the winner and killing the loser based on conversion rate [src-031].
  • Statsig also cautions that contextual bandits are less appropriate for careful calibration studies where inference quality matters more than traffic shifting [src-031].

Related entities

Related concepts

Source references

  • [src-019] Aleksandrs Slivkins — “Introduction to Multi-Armed Bandits” (2019-04-15; revised 2024-04-03)
  • [src-020] Tyler Lu, David Pál, Martin Pál — “Contextual Multi-Armed Bandits” (AISTATS 2010)
  • [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)
  • [src-023] Hightouch — “Under the hood of AI Decisioning, part one: Overcoming the personalization gap”
  • [src-025] Hightouch — “Under the hood of AI Decisioning, part three: Multi-armed bandits”
  • [src-026] Hightouch — “Under the hood of AI Decisioning, part four: Contextual bandits”
  • [src-027] Team Braze — “What is a multi-armed bandit? Smarter experimentation for real-time marketing”
  • [src-031] Yuzheng Sun — “Speeding up A/B tests with discipline”