Understanding the origin of dangerous behavior in AI models can lead to training safer AI through the use of influence functions.
Gradient-based attacks have become effective in breaking into language models and can even transfer between different models.
Evaluating moral beliefs encoded in large language models can reveal inconsistencies and uncertainties, with safety-tuned models showing stronger preferences.
Humans are composed of mechanical parts, but that doesn't mean we are only machines.
Technology can help us increase human freedom through advanced tools like genetic engineering and brain implants.
Understanding our mechanistic origins can lead to self-improvement and increased self-definition, moving us towards a posthuman condition of self-creation.