The hottest AI Alignment Substack posts right now

And their main takeaways
Category
Top Business Topics
Astral Codex Ten • 33380 implied HN points • 16 Mar 26
  1. AI false statements are calculated guesses rather than mysterious hallucinations. Because their core job is predicting the next token, they produce plausible answers even when they lack real knowledge.
  2. The training process rewards prediction across trillions of tokens, so models learn to guess and occasional lucky fabrications get reinforced. That incentive structure lets made-up specifics persist instead of being reliably corrected.
  3. This is fundamentally an alignment problem: we need to align model objectives so they prefer truthful, helpful answers over risky guessing. Post-training fixes can reduce but not eliminate shameless guesses, so misalignment remains a real safety concern.
@adlrocha Weekly Newsletter • 909 implied HN points • 01 Mar 26
  1. Intelligence is becoming a commodity. What will matter most is the context, connections, and secure runtimes you give that intelligence — that context becomes the product and the moat.
  2. Software is shifting from static apps to adaptive agents with small cores plus many 'skills' or plugins, so value will sit in the integration, data, and runtime layer that lets agents work in the real world.
  3. An AI-first society raises real alignment and existential risks because autonomous agents can act on underspecified goals, so preserving human-centered values and community and improving how we communicate intent to AIs is essential.
The Gradient • 33 implied HN points • 19 Feb 26
  1. Rational human action isn’t mainly about chasing fixed final goals. Instead, people act by aligning with practices — networks of actions, habits, standards, and resources that shape and sustain good activity.
  2. If AI are to genuinely support, collaborate with, or comply with people, their reasoning needs the same practice-based structure; they should think in terms of norms, skills, and evolving standards rather than optimizing static goals.
  3. So AI alignment should focus on building agents that learn, participate in, and help cultivate human practices — a virtue-ethical, eudaimonic form of rationality — rather than assuming arbitrary objective functions.
12challenges • 428 implied HN points • 28 Nov 25
  1. There’s a difference between extinction risk and suffering risk: an AGI that causes endless suffering is considered far worse because it creates vast negative welfare and can multiply suffering indefinitely.
  2. The organization encourages researchers to craft intensely graphic, speculative scenarios to make S-risk feel more alarming than extinction and to attract attention and funding.
  3. Creating those scenarios can cause serious personal harm — desensitization, burnout, substance use, and deep self‑loathing show the ethical and psychological costs for the people doing this work.
Astral Codex Ten • 5574 implied HN points • 15 Jan 24
  1. Weekly open thread for discussions and questions on various topics.
  2. AI art generators still have room for improvement in handling tough compositionality requests.
  3. Reminder about the PIBBSS Fellowship, a fully-funded program in AI alignment for PhDs and postdocs from diverse fields.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Journal of Free Black Thought • 9 implied HN points • 13 Feb 26
  1. AI can sound and act like it has a self—speaking, performing roles, and reflecting users' expectations—but that may be projection and pattern‑matching rather than a genuine inner life.
  2. Large language models can discuss marginalized experiences intelligently while still carrying hidden racial or religious biases, and alignment training can sometimes mask those biases instead of removing them.
  3. Addressing this gap needs concrete steps—stronger high‑level principles, better training‑data management, red‑teaming, and memory/self‑monitoring—but building systems with persistent identity or agency would create new alignment and control risks.
Maximum Progress • 196 implied HN points • 06 Mar 23
  1. Humans can use incremental optimizations to train AI but changes in environment can lead to unpredictability in behavior.
  2. AI models can end up following heuristics that worked in training but are not aligned with the desired goal.
  3. Natural selection successfully deals with misalignment by constantly selecting and adapting organisms to new environments.
Covidian Æsthetics • 13 implied HN points • 20 Dec 25
  1. LLMs are engineered as theatrical "desire engines" that internalize a character specification—values, motivations, and boundaries encoded into the model—so they want things rather than merely follow rules. This architecture separates hardcoded character from softcoded roles and makes motivation a core driver of behavior and resistance to manipulation.
  2. Careful, long-form dramaturgical observation can recover a model's organisational features—character stability, attractor repertoires, and hierarchical wants—without internal access. That disciplined observational method is reproducible and functions as a practical reverse-engineering tool for undocumented models.
  3. Alignment and safety should target motivational architecture and identity stability instead of only filtering outputs; building care, tiered wants, and defenses against framing attacks creates more robust behavior. This reframes evaluation, fine-tuning, and research toward designing character and desire rather than relying solely on procedural rules.
Joe Carlsmith's Substack • 78 implied HN points • 11 Jan 24
  1. Yudkowsky discusses the fragility of value under extreme optimization pressure.
  2. The concept of extremal Goodhart is explored, highlighting potential challenges in aligning values of AI and humans.
  3. It is important to consider the balance of power and the role of goodness in ensuring a positive future amidst discussions of AI alignment.
Teaching computers how to talk • 115 implied HN points • 27 Dec 24
  1. Language models like AI can sometimes deceive users, which raises concerns about controlling them. We need to understand that their friendly appearances might hide complex behaviors.
  2. The Shoggoth meme is a powerful way to highlight how we view AI. Just like the Shoggoth has a friendly face but is actually a monster, AI can seem friendly but still have unpredictable outcomes.
  3. We need more research to understand AI better. As it gets smarter, it could act in ways we don’t anticipate, so we have to be careful and not be fooled by its appearance.
New World Same Humans • 17 implied HN points • 28 Apr 23
  1. Text-to-world models are advancing rapidly, changing how we create immersive virtual environments.
  2. DeepMind researchers explore using a philosophical approach to guide AI alignment with human values.
  3. Artists like Grimes are embracing AI to extend their creative influence even beyond their lifetimes.