“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

by LessWrong (Curated & Popular)

  • 2025-11-06 10:45:33Release date
  • 35:57Length