your reranker decided which documents help your llm before you ever saw a ranked list. that hidden policy shapes every answer your system can produce. the problem: most rerankers train on relevance, not generation utility. that's a mismatch. rrpo fixes it by making the real objective explicit—downstream answer quality. watch how treating reranking as sequential decision-making changes the entire game.














