“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt
by
LessWrong (Curated & Popular)
2025-11-06 10:45:33
Release date
35:57
Length