AI safety takes • 58 implied HN points • 17 Oct 23
- Research shows that sparse autoencoders are being used to find interpretable features in neural networks.
- Language models have shown a struggle in learning reversals like 'A is B' vs 'B is A', highlighting challenges in their training.
- There are concerns and efforts to tackle AI deception, with studies on lie detection in black-box language models.