Unlock: Reward Hacking

Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.

420 Prerequisites0 Mastered0 Working278 Gaps

Prerequisite mastery34%

Recommended probe

Bernstein Inequality is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

Reward HackingTARGET

Bennett's InequalityCore

No quiz

Bernstein InequalityCoreWEAKEST

Not assessed4 questions

Sub-Exponential Random VariablesCore

Not assessed2 questions

Peano AxiomsAxioms

Not assessed5 questions

Hallucination TheoryResearch

Not assessed3 questions

Reward Design and Reward MisspecificationAdvanced

Not assessed1 question

Reward Models and VerifiersFrontier

No quiz

RLHF and AlignmentResearch

Not assessed3 questions