Offline Policy Evaluation
Offline policy evaluation is the practice of estimating how a new decision policy would have performed using historical logged data, before deploying that policy live.
Key points
- Yildirim treats offline policy evaluation as a necessary companion to contextual bandits because teams often need to evaluate a candidate policy using logged data rather than only live traffic [src-021].
- Causal-inference approaches such as inverse propensity scoring and doubly robust estimation estimate the counterfactual outcome of a different policy, but require knowing the logged policy’s action probabilities [src-021].
- Sampling/replay approaches evaluate a new policy by replaying logged examples and keeping only the cases where the new policy’s chosen action matches the logged action; non-uniform logging policies require adjustments such as rejection sampling or propensity weighting [src-021].
- The operational warning is metadata-heavy: logged bandit or A/B-test data should preserve action propensities, context, chosen action, and observed outcome, otherwise offline evaluation becomes biased or impossible [src-021].
Related entities
_(none yet)_
Related concepts
Source references
- [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)