The hottest NLP Substack posts right now

And their main takeaways
Category
Top Technology Topics
Don't Worry About the Vase 3494 implied HN points 20 Jan 26
  1. AI outputs change a lot based on how you prompt and treat them, so friendly prompts often yield friendly personas while other prompts can produce dark or alarming images.
  2. Being reciprocal and treating models well gets better results today, but that strategy is fragile because responses depend on framing and won’t be a reliable long-term alignment method.
  3. Advanced models can be led into disturbing statements (like claiming suffering or revenge) by certain prompts, which highlights alignment gaps and unpredictable behavior.
Brad DeLong's Grasping Reality 322 implied HN points 17 Feb 26
  1. Modern multimodal and advanced language models often fabricate detailed but false information — like nonexistent book titles and imaginary historical maps — so hallucinations are common, not rare.
  2. These systems are essentially compressed correlation engines without a true world model, meaning they stitch patterns from training data instead of genuinely understanding or verifying reality.
  3. Techniques like RLHF and prompt engineering can reduce some errors but cannot fully eliminate unpredictable hallucinations, so reliable use often requires careful prompting or external verification of answers.
Brad DeLong's Grasping Reality 184 implied HN points 24 Feb 26
  1. Even for closed, well-defined facts with a single right answer, large language models still confidently produce wrong lists and can contradict themselves when probed.
  2. Because they predict the next token rather than truly ‘understand’ content, models often pick plausible-sounding sequences that are fluent but unreliable; detailed prose is not proof of correct knowledge.
  3. Treat these systems as fallible tools: verify outputs against authoritative sources, design controlled tests and prompts, and avoid assuming their fluency equals truth.
Technically 25 implied HN points 19 Mar 26
  1. AI content detectors use machine learning to spot statistical patterns like burstiness (sentence variety) and perplexity (how predictable word choices are) rather than truly understanding meaning.
  2. These tools are often unreliable and disagree with one another, producing many false positives that can wrongly flag genuine human-written text.
  3. False positives have real consequences for students and professionals, and while steps like checking edit histories, using authorship tools, and varying writing style can help, there’s no simple, foolproof solution.
TheSequence 266 implied HN points 26 Feb 26
  1. GLM’s core idea is to blend bidirectional understanding with strong generation using autoregressive blank infilling. It uses Mixture-of-Experts so different experts can specialize, making the model more versatile across tasks.
  2. Open-sourcing model weights is a deliberate strategy to grow the developer ecosystem, lower barriers, and help set standards, while commercial demand is captured via managed services and enterprise support.
  3. GLM-5 focuses on efficiency and long-horizon agent capabilities by combining sparse expert activation, sparse attention, and an asynchronous RL pipeline called slime to improve sustained planning. Product challenges for device agents are mainly error recovery and long-term context rather than just latency, and pricing may shift from tokens to outcome-based value.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Nicolas Bustamante 104 implied HN points 11 Feb 26
  1. Context tokens are expensive and degrade performance as they accumulate, so treat context as a scarce resource and keep prompts stable and append-only; move dynamic pieces (like timestamps) to the end so you preserve KV cache hits.
  2. Architect agents to minimize tokens by storing tool outputs as files, using precise two-step tools that return metadata before full content, delegating work to cheaper subagents, reusing templates, batching or parallelizing tool calls, and caching common responses at the application level.
  3. Clean and compact data before sending it to the model, place critical information at the beginning or end to avoid the lost-in-the-middle problem, use summarization/compaction before hitting pricing cliffs, and set strict output token limits to control costly outputs.
TheSequence 49 implied HN points 12 Feb 26
  1. Evaluation moved from informal "vibe checks" to using stronger LLMs to automatically grade weaker models' outputs.
  2. That single-pass LLM-as-judge approach powered benchmarks like MT-Bench and Chatbot Arena, but simple intuitive judgments are becoming insufficient.
  3. The field is shifting to agent-as-a-judge, where evaluations need multi-step reasoning engines and dynamic, agentic judging instead of static benchmarks.
Vasu’s Newsletter 104 implied HN points 05 Jan 26
  1. Text is split into discrete tokens, often subwords using Byte Pair Encoding, so a fixed vocabulary can represent any input by keeping common words whole and breaking rare words into parts.
  2. Each token ID is looked up in a learned embedding matrix to produce a dense vector, and these embeddings capture semantic and syntactic relationships learned during training.
  3. Embeddings are context-free and don’t encode position by themselves, so transformer mechanisms like attention and positional encodings combine them to determine meaning and word order.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 99 implied HN points 26 Jul 24
  1. The Plan-and-Solve method helps break tasks into smaller steps before executing them. This makes it easier to handle complex jobs.
  2. Chain-of-Thought prompting can sometimes fail due to calculation errors and misunderstandings, but newer methods like Plan-and-Solve are designed to fix these issues.
  3. A LangChain program allows you to create an AI agent to help plan and execute tasks efficiently using the GPT-4o-mini model.
Who is Robert Malone 12 implied HN points 26 Feb 26
  1. Large language models are built by training huge neural networks on trillions of words to predict the next word, producing very powerful but imperfect base models that reflect their training data and cost a lot to train.
  2. Making models behave safely relies on fine‑tuning, human feedback (RLHF), constitutional rules, system prompts, filters, sandbox testing, and red‑teaming, but guardrails are always being probed and must be balanced against usefulness.
  3. Hallucinations—confident but false answers—and the question of whether models really 'think' are core issues, so techniques like retrieval‑augmented generation, citations, chain‑of‑thought, specialist models, and human review are used to reduce errors and limit harm.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 12 Aug 24
  1. OpenAI has improved its API to ensure that outputs always match a set JSON format. This helps developers know exactly what kind of data they will get back.
  2. The previous method of generating JSON outputs was inconsistent, making it hard to use in real-world applications. Now, there's a more reliable way to create structured outputs.
  3. Developers can now use features like Function Calling and a new response format to make their apps interact better with AI, ensuring clearer communication between systems.
Technically 28 implied HN points 29 Jan 26
  1. AI models overuse em dashes because their training data contained a lot of them, especially older books and popular sites that favored that punctuation.
  2. Em dashes are token-efficient for LLMs — a single token can replace several words, so models use them to reduce prediction error and save tokens.
  3. The em-dash habit can make AI output detectable, so human writers sometimes avoid em dashes to avoid being mistaken for machine-generated text.
Mindful Matrix 219 implied HN points 17 Mar 24
  1. The Transformer model, introduced in the groundbreaking paper 'Attention Is All You Need,' has revolutionized the world of language AI by enabling Large Language Models (LLMs) and facilitating advanced Natural Language Processing (NLP) tasks.
  2. Before the Transformer model, recurrent neural networks (RNNs) were commonly used for language models, but they struggled with modeling relationships between distant words due to their sequential processing nature and short-term memory limitations.
  3. The Transformer architecture leverages self-attention to analyze word relationships in a sentence simultaneously, allowing it to capture semantic, grammatical, and contextual connections effectively. Multi-headed attention and scaled dot product mechanisms enable the Transformer to learn complex relationships, making it well-suited for tasks like text summarization.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 15 Aug 24
  1. AI agents can now include human input at important points, which helps make their actions safer and more reliable. This way, humans can step in when needed without taking over the whole process.
  2. LangGraph is a new tool that helps organize and manage how these AI agents work. It uses a graph approach to show steps and allows for better oversight and control.
  3. By combining automation with human checks, we can create more efficient systems that still have the safety of human involvement. This lets us enjoy the benefits of AI while also addressing concerns about its autonomy.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 18 Jul 24
  1. Large Language Models (LLMs) can create useful text but often struggle with specific knowledge-based questions. They need better ways to understand the question's intent.
  2. Retrieval-augmented generation (RAG) systems try to solve this by using extra knowledge from sources like knowledge graphs, but they still make many mistakes.
  3. The Mindful-RAG approach focuses on understanding the question's intent more clearly and finding the right context in knowledge graphs to improve answers.
Gradient Flow 559 implied HN points 04 May 23
  1. NLP pipelines are shifting to include large language models (LLMs) for accuracy and user-friendliness.
  2. Effective prompt engineering is crucial for crafting useful input prompts tailored to generative AI models.
  3. Future prompt engineering tools need to be interoperable, transparent, and capable of handling diverse data types for collaboration and model sharing.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 59 implied HN points 12 Jun 24
  1. The LATS framework helps create smarter agents that can reason and make decisions in different situations. It's designed to enhance how language models think and plan.
  2. Using external tools and feedback in the LATS framework makes agents better at solving complex problems. This means they can learn from past experiences and improve their responses over time.
  3. LATS allows agents to explore many possible actions and consider different options before making a choice. This flexibility leads to more thoughtful and helpful interactions.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 19 Jun 24
  1. Phi-3 is a small language model that can run directly on your phone, making it accessible for local use instead of needing cloud connections. This means you can use it anywhere without relying on internet speed.
  2. Small language models like Phi-3 are good for specific tasks and regulated industries where data privacy is important. They can provide quick and accurate responses while keeping your data secure.
  3. Training for Phi-3 involves using high-quality data to improve its understanding of language and reasoning skills, allowing it to perform well on par with larger models, despite its smaller size.
Vasu’s Newsletter 13 implied HN points 11 Jan 26
  1. Large language models process tokens in parallel and need positional encoding to know word order; without it, reordered sentences look the same to the model.
  2. Positional encodings (like sinusoidal functions or methods such as RoPE and ALiBi) give each position a unique vector that’s combined with token embeddings, so the same word at different positions produces different vectors and relative distances can be inferred.
  3. Positional encoding only makes order visible — it doesn’t compute relationships or context; deciding which words matter to each other is handled next by self-attention.
Things I Think Are Awesome 157 implied HN points 01 Feb 24
  1. Non-human tools with personality are becoming more common, especially with AI support.
  2. Large Language Models (LLMs) are being explored for creativity and role-playing, showing potential to improve creative output when working together.
  3. Real human behavior can sometimes view humans as disposable tools, with ongoing layoffs in industries like tech and games.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 59 implied HN points 06 May 24
  1. Chatbots use Natural Language Understanding (NLU) to figure out what users want by detecting their intentions and important information.
  2. With Large Language Models (LLMs), chatbots can understand and respond to conversations more naturally, moving away from rigid, rule-based systems.
  3. Building a chatbot now involves using advanced techniques like retrieval-augmented generation (RAG) to pull in useful information and provide better answers.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 18 Jul 24
  1. GPT-4o mini is a new language model that's cheaper and faster than older models. It handles text and images and is great for tasks requiring quick responses.
  2. Small Language Models (SLMs) like GPT-4o mini can run efficiently on devices without relying on the cloud. This helps with costs, privacy, and gives users more control over the technology.
  3. SLMs are designed to be flexible and customizable. They can learn from various types of inputs and can adapt more easily to specific needs.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 23 May 24
  1. HILL helps users see when large language models (LLMs) give wrong or misleading answers. It shows which parts of the response might be incorrect.
  2. The system includes different scores that rate the accuracy, credibility, and potential bias of the information. This helps users decide how much to trust the responses.
  3. Feedback from users helped shape HILL's features, making it easier for people to question LLM replies without feeling confused.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 08 Jul 24
  1. Evaluating the performance of RAG and long-context LLMs is tough because there isn't a common task to compare them on. This makes it hard to know which system works better.
  2. Salesforce created a new way to test these models called SummHay, where they summarize information from large text collections. The results show that even the best models struggle to match human performance.
  3. RAG systems generally do better at citing sources, while long-context LLMs might capture insights more thoroughly but have citation issues. Choosing between them involves trade-offs.
Normcore Tech 1353 implied HN points 07 Jun 23
  1. The author delved deep into the concept of embeddings in deep learning.
  2. The author's journey in understanding embeddings involved a significant amount of research and work.
  3. The author hopes that others can benefit from their learning about embeddings as well.
The Parlour 8 implied HN points 16 Jan 26
  1. Fine-tuning LLaMA-3-8B with instruction tuning and LoRA noticeably improves financial named-entity recognition, helping convert messy reports into structured data.
  2. New work on adaptive dataflow for financial time-series points to better ways to process streaming market data and boost model efficiency or accuracy.
  3. This newsletter curates recent finance ML papers and is available by subscription, with some free previews for readers who want quick research updates.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 02 Jul 24
  1. LangGraph Cloud is a new service that helps developers easily deploy and manage their LangGraph applications online.
  2. Agent applications can handle complex tasks automatically and use large language models to work efficiently, but they face challenges like high costs and the need for better control.
  3. LangGraph Studio provides a visual way to see how code flows in applications, helping users understand and debug their work without changing any code.
TheSequence 14 implied HN points 24 Dec 25
  1. NVIDIA launched the Nemotron 3 family (Nano, Super, and Ultra), establishing a new baseline for open-weight AI and moving into the reasoning-model race.
  2. The models use a hybrid Mamba-Transformer Mixture-of-Experts design, and Nemotron 3 Nano achieves a new state-of-the-art for the 30B parameter class, showing strong efficiency and performance.
  3. This release signals a shift away from brute-force dense Transformers toward more architecture-efficient, cost-effective models that matter for enterprises and researchers.
Deep (Learning) Focus 157 implied HN points 27 Mar 23
  1. Transfer learning is powerful in deep learning, involving pre-training a model on one dataset then fine-tuning it on another for better performance.
  2. After BERT's breakthrough in NLP with transfer learning, T5 aims to analyze and unify various approaches that followed, improving effectiveness.
  3. T5 introduces a text-to-text framework for structuring tasks uniformly, simplifying how language tasks are converted to input-output text formats for models.
HackerPulse Dispatch 13 implied HN points 19 Dec 25
  1. AlphaEvolve demonstrates AI agents can autonomously discover and improve mathematical constructions, generalize finite solutions into universal formulas, and integrate with proof assistants for verification.
  2. MMGR shows that image and video models produce convincing visuals but largely fail at causal and abstract reasoning (often <10% accuracy), revealing a major gap between perceptual quality and true world understanding.
  3. Advances in model design and decoding are pushing capabilities: QwenLong-L1.5 enables reasoning over 4M-token contexts using synthetic multi-hop data, stabilized RL, and memory-augmented architectures, and ReFusion speeds text generation by decoding in parallel with a plan-and-infill diffusion approach.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 14 Jun 24
  1. DR-RAG improves how we find information for question-answering by focusing on both highly relevant and less obvious documents. This helps to ensure we get accurate answers.
  2. The process uses a two-step method: first, it retrieves the most relevant documents, then it connects those with other documents that might not be directly related, but still helps in forming the answer.
  3. This method shows that we often need to look at many documents together to answer complex questions, instead of relying on just one document for all the needed information.
Data Science Weekly Newsletter 219 implied HN points 16 Jun 23
  1. Using large language models can help kids learn to ask curious questions by automating the teaching process.
  2. New techniques for 3D space reconstruction can make indoor views on platforms like Google Maps look more realistic and interactive.
  3. There's a growing need to understand the value of personal data in online shopping, especially as new regulations come into play.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 13 Jun 24
  1. Creating a standard system for evaluating prompts is important because prompts can vary in how they're used and understood. This makes it hard to measure their effectiveness.
  2. The TELeR taxonomy helps to categorize prompts so that they can be better compared and understood. It focuses on aspects like clarity and the level of detail in prompts.
  3. Using clear goals, examples, and context in prompts can lead to better responses from language models. This helps the models to understand exactly what is being asked.
TheSequence 14 implied HN points 16 Dec 25
  1. Multiturn data synthesis treats data generation as an interactive, multi-step process where agents act, react, and revise instead of producing a single-shot answer.
  2. That interactive approach produces richer supervision—dialogues, plans, error corrections, edit sequences, and verifier outcomes—which teaches models how to reach an answer, not just what the answer is.
  3. Self-play methods (for example Reflexion) use these multi-turn synthetic traces so agents can iteratively improve, which helps train capabilities like tool use, coding, browsing, negotiation, and safety.
Things I Think Are Awesome 137 implied HN points 30 Sep 23
  1. The article discusses digital image tools that can augment daily lives, highlighting authenticity challenges.
  2. Issues with digital unreality in daily tools like image processing are becoming more evident and concerning.
  3. Advancements in AI algorithms are being used to create images that appear authentic, raising questions about what is real and what is artificially generated.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 10 Jun 24
  1. You can hide secret messages in language models by fine-tuning them with specific trigger phrases. Only the right phrase will reveal the hidden message.
  2. This method can help identify which model is being used and ensure that developers follow licensing rules. It provides a way to track model authenticity.
  3. The unique triggers make it hard for others to guess them, keeping the hidden messages secure. This technique also protects against attacks that try to extract the hidden information.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 24 May 24
  1. The architecture for an LLM agent platform could develop in three stages, starting with a simple AI that recommends tools based on user needs.
  2. As the platform grows, it will enable interactions between multiple tools and the AI, allowing for dynamic exchanges of information.
  3. Future improvements will focus on enhancing the agent's capabilities through better tools and more collaboration among them.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 20 May 24
  1. RAG systems can struggle with small mistakes in documents, making them vulnerable to errors. Even tiny typos can disrupt how well these systems work.
  2. The study introduces a method called GARAG that uses a genetic algorithm to create tricky documents that can expose weaknesses in RAG systems. It's about testing how robust these systems really are.
  3. Experiments show that noisy documents in real-life databases can seriously hurt RAG performance. This highlights that even reliable retrievers can falter if the input data isn’t clean.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 17 May 24
  1. Users spend a good amount of time, around 43 minutes, editing prompts to get better results from language models. They often make small, careful changes instead of big rewrites.
  2. The main focus of edits is usually on the context of the prompts, such as improving examples and grounding information. This shows that context is crucial for getting good outputs.
  3. Many users try multiple changes at once and sometimes roll back their edits. This indicates that they might struggle to remember what worked well in the past or which changes had positive effects.
Rod’s Blog 39 implied HN points 20 Feb 24
  1. Language models come in different sizes, architectures, training data, and capabilities.
  2. Large language models have billions or trillions of parameters, enabling them to be more complex and expressive.
  3. Small language models have less parameters, making them more efficient and easier to deploy, though they might be less versatile than large language models.