Offline Policy Evaluation

Offline policy evaluation is the practice of estimating how a new decision policy would have performed using historical logged data, before deploying that policy live.

Key points

Yildirim treats offline policy evaluation as a necessary companion to contextual bandits because teams often need to evaluate a candidate policy using logged data rather than only live traffic ^[src-021].
Causal-inference approaches such as inverse propensity scoring and doubly robust estimation estimate the counterfactual outcome of a different policy, but require knowing the logged policy’s action probabilities ^[src-021].
Sampling/replay approaches evaluate a new policy by replaying logged examples and keeping only the cases where the new policy’s chosen action matches the logged action; non-uniform logging policies require adjustments such as rejection sampling or propensity weighting ^[src-021].
The operational warning is metadata-heavy: logged bandit or A/B-test data should preserve action propensities, context, chosen action, and observed outcome, otherwise offline evaluation becomes biased or impossible ^[src-021].

Related entities

_(none yet)_

Related concepts

Source references

^[src-021] Ugur Yildirim — “An Overview of Contextual Bandits” (2024-02-02)

Offline Policy Evaluation

Offline Policy Evaluation

Key points

Related entities

Related concepts

Source references

Explore Robin's AI portfolio

Recent posts

Archive

Tags

Senior AI product leadership

Robin Cartier

Company

Services