“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

od LessWrong (Curated & Popular)

  • 2025-11-06 10:45:33Datum izdaje
  • 35:57Trajanje