“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda

by LessWrong (Curated & Popular)

  • 2025-05-05 00:45:22Release date
  • 13:15Length