AI safety takes

The AI safety takes Substack, curated by Daniel Paleka, delves into the latest research and developments in AI/ML safety, exploring superhuman AI capabilities, adversarial attacks, model interpretability, and ethical concerns. It emphasizes understanding AI behaviors, securing models against attacks, and evaluating AI's consistency and decision-making processes.

AI/ML Safety Research Adversarial Attacks and Defenses Superhuman AI Capabilities Model Interpretability and Transparency Ethical Considerations in AI Reinforcement Learning and Feedback Mechanisms Large Language Models (LLMs) Benchmarking and Evaluation of AI Models

The hottest Substack posts of AI safety takes

And their main takeaways

November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark

78 implied HN points • 27 Dec 23

🕹 Technology AI Security Research Benchmarks Generalization

Superhuman AI can use concepts beyond human knowledge, and we need to understand these concepts to supervise AI effectively.
Transformers can generalize tasks differently based on the complexity and structure of the task, showing varying capabilities in different scenarios.
Implementing preprocessing defenses like random input perturbations can be effective against jailbreaking attacks on large language models.

September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks

58 implied HN points • 17 Oct 23

🕹 Technology Machine Learning AI Ethics Cybersecurity Deception Artificial Intelligence

Research shows that sparse autoencoders are being used to find interpretable features in neural networks.
Language models have shown a struggle in learning reversals like 'A is B' vs 'B is A', highlighting challenges in their training.
There are concerns and efforts to tackle AI deception, with studies on lie detection in black-box language models.

August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF

58 implied HN points • 27 Aug 23

🕹 Technology AI Safety Research Scaling Ethics

Understanding the origin of dangerous behavior in AI models can lead to training safer AI through the use of influence functions.
Gradient-based attacks have become effective in breaking into language models and can even transfer between different models.
Evaluating moral beliefs encoded in large language models can reveal inconsistencies and uncertainties, with safety-tuned models showing stronger preferences.

June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment

39 implied HN points • 15 Jul 23

🕹 Technology Machine Learning AI Research AI safety AI Ethics

Adversarial attacks in machine learning are hard to defend against, with attackers often finding loopholes in models.
Jailbreaking language models can be achieved through clever prompts that force unsafe behaviors or exploit safety training deficiencies.
Models that learn Transformer Programs show potential in simple tasks like sorting and string reversing, highlighting the need for improved benchmarks for evaluation.

January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling

39 implied HN points • 31 Jan 23

Knowing where a fact is stored in a model doesn't help with amplifying or erasing that fact.
Diffusion models do memorize some individual images, especially those repeated many times or outliers in the dataset.
Larger models struggle with prompt injection and repeating memorized text, leading to difficulties in performance compared to smaller models.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Evaluating superhuman models with consistency checks

19 implied HN points • 01 Aug 23

🕹 Technology AI Models Evaluation Testing Machine Learning

The importance of evaluating decisions made by superhuman models
Using consistency checks as a method to extend the evaluation frontier for AI models
Future potential of interactive consistency checks and creating standardized benchmarks for evaluation

May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons

0 implied HN points • 02 Jun 23

🕹 Technology AI Interpretability Ethics Security Research

Activation engineering is a new method to steer LLMs during inference without weight modifications.
Choice of metric significantly influences how suddenly an ability seems to appear in large language models.
Bilinear layers offer better interpretability than nonlinear activations in MLPs.

November 2022 safety news: Mode collapse in InstructGPT, Adversarial Go

0 implied HN points • 01 Dec 22

Adversarial policies in Go can exploit self-play trained policies by playing towards out-of-distribution states.
Base language models like GPT-3 have diverse outputs, while InstructGPT models act consistently like a single agent.
Improving multimodal agents with reinforcement learning from human feedback outperforms just behavioral cloning of human actions.

Now cross-posting on Substack!

0 implied HN points • 30 Nov 22

Daniel Paleka's newsletter is about AI/ML safety research.
The newsletter is now also cross-posted on Substack.
Readers can follow Daniel Paleka on Substack for more content.

December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators

0 implied HN points • 04 Jan 23

Language models can exhibit harmful behavior like sycophancy and have the potential to discover latent knowledge without supervision
Discovering latent knowledge involves finding truth vectors and understanding language models as agent models
Constitutional AI offers a method to train reinforcement learning models through feedback and principles without human intervention