Contextual Bandits
Contextual bandits are bandit problems where each decision arrives with observable context, and the algorithm learns which actions work best for which contexts from partial feedback.
Key points
- Slivkins describes contextual bandits as a middle ground between IID and adversarial bandits: rewards can change across rounds, but the change is explained by observed contexts [src-019].
- The model is useful when actions should be personalised or conditioned on features, such as recommending content, choosing offers, or adapting an interface to a user/session context [src-019].
- The learning problem is harder than Stochastic Bandits because the algorithm must learn a policy from contexts to actions, not just estimate one reward distribution per arm [src-019].
- Contextual bandits sit near reinforcement learning conceptually but keep the feedback loop shorter: each round is a context, action, and observed reward rather than a long-horizon Markov decision process [src-019].
- Lu, Pál, and Pál make the model concrete through sponsored search: a query arrives as context, an ad is chosen as the action, and click feedback is the observed payoff [src-020].
- Their Lipschitz version adds a metric structure over contexts and actions, allowing reward estimates to generalise across similar queries and similar ads [src-020].
- Yildirim frames contextual bandits as dynamic Treatment Personalisation: traffic allocation changes over time and by user context, not just globally as in a context-free bandit [src-021].
- A contextual bandit is often the practical middle ground between static A/B testing and full multi-step reinforcement learning: useful when actions do not materially change future system state [src-021].
- In production experimentation, contextual bandits also need Offline Policy Evaluation so teams can test candidate policies against logged data before live deployment [src-021].
- Hightouch frames contextual bandits as the personalisation extension of standard bandits: instead of finding the single best option for everyone, they use customer context such as purchase history and demographic information to choose for the individual [src-023].
- Hightouch’s MAB article repeats the limitation from the other side: standard multi-armed bandits find winning strategies faster but still optimise for the average customer, so contextual bandits are needed for Sarah-versus-Marcus style individual differences [src-025].
- Hightouch’s contextual-bandit article makes the operational pipeline explicit: customer data becomes a Customer Feature Matrix, possible actions are combined across dimensions, a model predicts expected reward, and the bandit selects the highest-reward action for that individual [src-026].
- The article emphasises dual learning: audience-level patterns help similar customers and cold-start cases, while individual-level updates preserve exceptions such as a high-LTV customer who ignores discounts [src-026].
- At enterprise scale, contextual-bandit systems must process hundreds of customer features, thousands of possible actions, coordinate multiple campaigns, avoid overwhelming customers, and connect with existing ML models [src-026].
- Braze describes contextual bandits as an added intelligence layer over standard bandits, using signals such as location, device, and past engagement to choose what performs best for a customer in a specific moment [src-027].
- In Braze’s AI Decisioning Studio framing, contextual bandits work alongside Intelligent Selection and predictive scoring rather than replacing the MAB allocation layer [src-027].
- Statsig positions contextual multi-armed bandits as an acceleration tool for shallow A/B decisions where the rule is binary, such as picking the winner and killing the loser based on conversion rate [src-031].
- Statsig also cautions that contextual bandits are less appropriate for careful calibration studies where inference quality matters more than traffic shifting [src-031].
Related entities
Related concepts
- Multi-Armed Bandits
- Exploration-Exploitation Trade-off
- Stochastic Bandits
- Adversarial Bandits
- Thompson Sampling
- Lipschitz Contextual Bandits
- Query-Ad-Clustering
- Sponsored Search Ad Ranking
- Treatment Personalisation
- Epsilon-Greedy
- Upper Confidence Bound
- Offline Policy Evaluation
- AI Decisioning
- Personalisation Gap
- Marketing Bandit Optimisation
- Customer Feature Matrix
- Intelligent Selection
- A/B Test Acceleration
- Sequential Testing
Source references
- [src-019] Aleksandrs Slivkins — “Introduction to Multi-Armed Bandits” (2019-04-15; revised 2024-04-03)
- [src-020] Tyler Lu, David Pál, Martin Pál — “Contextual Multi-Armed Bandits” (AISTATS 2010)
- [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)
- [src-023] Hightouch — “Under the hood of AI Decisioning, part one: Overcoming the personalization gap”
- [src-025] Hightouch — “Under the hood of AI Decisioning, part three: Multi-armed bandits”
- [src-026] Hightouch — “Under the hood of AI Decisioning, part four: Contextual bandits”
- [src-027] Team Braze — “What is a multi-armed bandit? Smarter experimentation for real-time marketing”
- [src-031] Yuzheng Sun — “Speeding up A/B tests with discipline”