Redwood Research blog • 0 implied HN points • 07 May 24
- The most reasonable strategy to assess if AI models are deceptively aligned is to test their capability; incompetent models are less likely to be deceptively aligned.
- By using capability evaluations, models tend to fall into categories of untrusted smart models and trusted dumb models.
- Combining dumb trusted models with limited human oversight can help mitigate the risks posed by untrusted smart models.