Offline Policy Evaluation

Offline Policy Evaluation

Offline policy evaluation is the practice of estimating how a new decision policy would have performed using historical logged data, before deploying that policy live.

Key points

  • Yildirim treats offline policy evaluation as a necessary companion to contextual bandits because teams often need to evaluate a candidate policy using logged data rather than only live traffic [src-021].
  • Causal-inference approaches such as inverse propensity scoring and doubly robust estimation estimate the counterfactual outcome of a different policy, but require knowing the logged policy’s action probabilities [src-021].
  • Sampling/replay approaches evaluate a new policy by replaying logged examples and keeping only the cases where the new policy’s chosen action matches the logged action; non-uniform logging policies require adjustments such as rejection sampling or propensity weighting [src-021].
  • The operational warning is metadata-heavy: logged bandit or A/B-test data should preserve action propensities, context, chosen action, and observed outcome, otherwise offline evaluation becomes biased or impossible [src-021].

Related entities

_(none yet)_

Related concepts

Source references

  • [src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)