Upper Confidence Bound
Upper Confidence Bound is a bandit exploration strategy that scores actions by combining estimated reward with an uncertainty bonus, favouring actions that look good or have not been tried enough.
Key points
- Yildirim describes UCB as choosing based on both the predicted outcome and confidence in that estimate, so under-sampled actions remain eligible for exploration [src-021].
- In simple bandits, uncertainty can be tied to how often an action has been served; in contextual bandits, uncertainty must account for how often actions have been served in relevant contexts [src-021].
- With high-dimensional context, UCB usually needs a model-based form. The article points to LinUCB as a popular version that formalises contextual UCB in a linear model framework [src-021].
- UCB is more directed than Epsilon-Greedy because exploration is focused where uncertainty is high rather than allocated as undifferentiated random traffic [src-021].
Related entities
_(none yet)_
Related concepts
- Contextual Bandits
- Exploration-Exploitation Trade-off
- Epsilon-Greedy
- Query-Ad-Clustering
- Thompson Sampling
Source references
- [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)