The hottest Data science Substack posts right now

And their main takeaways
Category
Top Technology Topics
TheSequence 161 implied HN points 30 Jan 25
  1. GPT models are becoming more advanced in reasoning and problem-solving, not just generating text. They are now synthesizing programs and refining their results.
  2. There's a focus on understanding how these models work internally through ideas like hypothesis search and program synthesis. This helps in grasping the real innovation they bring.
  3. Reinforcement learning is a key technique used by newer models to improve their outputs. This shows that they are evolving and getting better at what they do.
Data Science Weekly Newsletter 179 implied HN points 30 Jun 23
  1. Data scientists are sharing tips on how to make their scientific data more accessible and useful. This helps others to understand and use the data better.
  2. There are many discussions happening about the benefits and drawbacks of large language models (LLMs) like ChatGPT. Some people believe they are amazing, while others think they aren't very helpful.
  3. Naming things in programming can be tough, but there are resources and books that can help. Learning the right naming conventions can improve coding practices.
Data Science Weekly Newsletter 199 implied HN points 02 Jun 23
  1. Data drift doesn't always hurt model performance, so it's important to analyze the context before reacting to it.
  2. Work on solving bigger problems as you grow in your career, instead of waiting for difficult tasks to be handed to you.
  3. To improve a model's reasoning skills, reward it for each correct step in problem-solving, not just the final answer.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 59 implied HN points 09 Feb 24
  1. The study compared answers from humans, a basic LLM, and an LLM that uses RAG to see which is most accurate in healthcare. The LLM with RAG performed the best.
  2. Using RAG, the model was much quicker than humans, taking only about 15-20 seconds. Humans took around 10 minutes to respond.
  3. GPT-4, especially with RAG, showed high accuracy and can support doctors by providing fast and reliable answers, but humans should still check the information.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Brad DeLong's Grasping Reality 69 implied HN points 25 Jun 25
  1. Machines, like large language models, can imitate human language because they find patterns hidden in how we express ourselves. They simplify the chaos of our words into something easier to understand.
  2. Even though these models are good at predicting responses, they struggle with truly understanding the world. They can replicate language well, but grasping the deeper meaning remains a challenge.
  3. The hope is that with better training and understanding causal relationships, these models could evolve to not only imitate but truly comprehend the world around them.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 25 Mar 24
  1. Choosing technology depends on what you need to achieve. Focus on the specific requirements of the problem to find the right solution.
  2. Retrieval-Augmented Generation (RAG) is often more effective than Fine-Tuning for knowledge base tasks. It allows for quick searches and better accuracy.
  3. RAG systems are easier to update with new information compared to Fine-Tuned models. You can simply add new data without complex adjustments.
Data Science Weekly Newsletter 279 implied HN points 02 Feb 23
  1. The newsletter is now hosted on Substack and remains free for everyone. A paid option is available for more features and interactions.
  2. Data teams need to build trust with stakeholders to effectively measure their value and justify their budgets. Having good relationships is more important than just metrics.
  3. Understanding MLOps is crucial for the industry. It involves not only the tools but also the culture and practices around machine learning operations.
Mind Prison 73 implied HN points 17 Jun 25
  1. AI hallucinations happen because AI relies on patterns from limited data, which can't cover everything. This means AI will always make mistakes when trying to understand things outside its knowledge.
  2. We need to treat all AI outputs with caution since they can all be hallucinations. It's important to check and verify what the AI says, especially in critical situations.
  3. The issue of hallucinations is built into how AI works, so trying to completely fix them isn't possible. Instead, we should focus on verifying AI results to ensure reliability.
Gonzo ML 189 implied HN points 29 Nov 24
  1. There's a special weight in large language models called the 'super weight.' If you remove it, the model's performance crashes dramatically, showing just how crucial it is.
  2. Super weights are linked to what's called 'super activations,' meaning they help generate better text. Without them, the model struggles to create coherent sentences.
  3. Finally, researchers found ways to identify and protect these super weights during the model training and quantization processes. This makes the model more efficient and retains its quality.
Gonzo ML 63 implied HN points 06 Jul 25
  1. Small weight updates during model training can lead to better results, especially since large weights might hold key features that we don't want to change.
  2. Using a method called NanoAdam, we can focus on smaller weights, which allows for more efficient memory usage and better performance during fine-tuning.
  3. It seems that large gradients often come from small weights, suggesting that sometimes it’s smarter to update these smaller weights instead of the larger ones.
LLMs for Engineers 59 implied HN points 30 Jan 24
  1. Fine-tuned open-source models like Llama and Mistral can produce accurate feedback, similar to high-performing custom models. They're a great option for companies needing control over their data.
  2. Using tools like Axolotl and Modal makes it easier to fine-tune these models. They help create customized training jobs and simplify deploying models across multiple GPUs.
  3. Fine-tuning significantly improves the clarity and structure of the model's output. It reduces irrelevant information, allowing for cleaner, more useful results.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 21 Mar 24
  1. Chain-of-Instructions (CoI) fine-tuning allows models to handle complex tasks by breaking them down into manageable steps. This means that a task can be solved one part at a time, making it easier to follow.
  2. This new approach improves the model's ability to understand and complete instructions it hasn't encountered before. It's like teaching a student to tackle complex problems by showing them how to approach each smaller task.
  3. Training with minimal human supervision leads to efficient dataset creation that can empower models to reason better. It's as if the model learns on its own, becoming smarter and more capable through well-designed training.
Mindful Modeler 299 implied HN points 27 Sep 22
  1. Predictions can change the outcome, leading to performative prediction. This can impact model performance.
  2. Performative prediction is common but often overlooked, affecting tasks like rent prediction and churn modeling.
  3. To deal with performative prediction, consider achieving performative stability, retraining models frequently, and reframing tasks as reinforcement learning.
Gonzo ML 126 implied HN points 23 Feb 25
  1. Gemini 2.0 models can analyze research papers quickly and accurately, supporting large amounts of text. This means they can handle complex documents like academic papers effectively.
  2. The DeepSeek-R1 model shows that strong reasoning abilities can be developed in AI without the need for extensive human guidance. This could change how future models are trained and developed.
  3. Distilling knowledge from larger models into smaller ones allows for efficient and accessible AI that can perform well on various tasks, which is useful for many applications.
TheSequence 175 implied HN points 09 Dec 24
  1. RAG techniques combine the power of language models with external data to improve accuracy. This means AI can give better answers by using real-world information.
  2. Advanced methods like Small to Slide RAG make it easier for AI to work with visual data, like slides and images. This helps AI understand complex information that is not just text.
  3. ColPali is a new approach that focuses on visuals directly, avoiding mistakes from converting images to text. It's useful for areas like design and technical documents, ensuring important details are not missed.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 18 Mar 24
  1. Long context windows (LCWs) and retrieval-augmented generation (RAG) serve different purposes and won’t replace each other. LCWs work well when asking multiple questions at once, while RAG is better for separate inquiries.
  2. Using LCWs can get really expensive because they involve processing a lot of data at once. In contrast, RAG uses smaller, focused data chunks, which helps keep costs down.
  3. Research shows that LLMs perform better when important information is at the start or end of a long context. So, relying only on LCWs can lead to problems since crucial details may get overlooked.
Data Science Weekly Newsletter 239 implied HN points 23 Feb 23
  1. The 2023 MAD landscape provides insights into machine learning and data trends. It has sections on the current market, infrastructure, and AI trends.
  2. A new tool called PyGWalker turns Pandas dataframes into easy-to-explore visual interfaces. It's great for beginners wanting to visualize their data without technical hassle.
  3. Cleaning data is essential for reliable research findings. New methods are being shared to improve and standardize the data cleaning process, making it more efficient.
davidj.substack 59 implied HN points 25 Jun 25
  1. Snowflake and Databricks are using a semantic layer, which helps make data easier to understand and access. This is a shift from older methods that relied heavily on text-based commands.
  2. The rise of AI has changed what businesses need from their analytics tools. Now, having a semantic layer is a must for companies that want to stay competitive in agentic analytics.
  3. Headless business intelligence is fading away as companies now blend traditional analytics with smarter, AI-driven tools. This could change how data warehouses and BI tools work together in the future.
Gonzo ML 126 implied HN points 08 Feb 25
  1. DeepSeek-V3 uses a lot of training data, with 14.8 trillion tokens, which helps it learn better and understand more languages. It's been improved with more math and programming examples for better performance.
  2. The training process has two main parts: pre-training and post-training. After learning the basics, it gets fine-tuned to enhance its ability to follow instructions and improve its reasoning skills.
  3. DeepSeek-V3 has shown impressive results in benchmarks, often performing better than other models despite having fewer parameters, making it a strong competitor in the AI field.
Artificial Ignorance 117 implied HN points 25 Feb 25
  1. Claude 3.7 introduces a new way to control reasoning, letting users choose how much reasoning power they want. This makes it easier to tailor the AI’s responses to fit different needs.
  2. The competition in AI models is heating up, with many companies launching similar features. This means users can expect similar quality and capabilities regardless of which AI they choose.
  3. Anthropic is focusing on making Claude better for real-world tasks, rather than just excelling in benchmarks. This is important for businesses looking to use AI effectively.
Data Science Weekly Newsletter 239 implied HN points 09 Feb 23
  1. Big Data is changing, and it's not as big a deal as we thought. Hardware is getting better faster than data sizes are growing.
  2. Research in AI can be learned just like a sport. It's about practicing skills like designing experiments and writing papers.
  3. Data Analytics can really help businesses understand their performance and make smarter decisions. It’s all about using data to solve problems and anticipate future issues.
TheSequence 14 implied HN points 26 Nov 25
  1. Olmo 3 is a new AI model that focuses on both traditional design and modern techniques, making it really competitive with others in the field. It pays attention to how it's built, trained, and shared with the public.
  2. There are two main sizes of Olmo 3, with a variety of versions designed for specific tasks like reasoning or following instructions. Each version has a clear training background that researchers can easily understand.
  3. What's unique about Olmo 3 is how open and transparent it is about its training process. This helps other researchers see exactly how it learns and improves.
Teaching computers how to talk 178 implied HN points 04 Nov 24
  1. Hallucinations in AI mean the models can give wrong answers and still seem confident. This overconfidence is a big problem, making it hard to trust what they say.
  2. OpenAI's SimpleQA helps check how often AI gets facts right. The results show that many times the AI doesn't know when it’s wrong.
  3. The way AI is built makes it hard for them to understand their own errors. Improvements are needed, but current technology has limitations in recognizing when they're unsure.
Interconnected 138 implied HN points 03 Jan 25
  1. DeepSeek-V3 is an AI model that is performing as well or better than other top models while costing much less to train. This means they're getting great results without spending a lot of money.
  2. The AI community is buzzing about DeepSeek's advancements, but there seems to be less excitement about it in China compared to outside countries. This might show a difference in how AI news is perceived globally.
  3. DeepSeek has a few unique advantages that set it apart from other AI labs. Understanding these can help clarify what their success means for the broader AI competition between the US and China.
Technology Made Simple 99 implied HN points 04 Apr 23
  1. Reducing the number of features in your data can improve performance and keep costs down in machine learning processes.
  2. Active learning focuses on prioritizing data points for efficient machine learning model training.
  3. Using filters and simpler models for specific tasks can lead to better performance and cost savings compared to always using large, powerful models in AI.
The Counterfactual 99 implied HN points 25 Sep 23
  1. Researchers often use survey data to understand human behavior, but collecting reliable human responses can be complicated and expensive. Using large language models (LLMs) like GPT-4 could make this process easier and cheaper.
  2. LLMs can sometimes produce responses that closely match the average opinions of many people. In some cases, their answers were actually more aligned with the average responses than individual human judgments.
  3. While LLMs can be helpful in gathering data quickly and inexpensively, it's important to be careful. They might not always be accurate or representative of all viewpoints, so it's wise to compare LLM results with human responses to ensure quality.
TheSequence 126 implied HN points 31 Jan 25
  1. Augmented SBERT (AugSBERT) improves sentence scoring tasks by using data augmentation to create more sentence pairs. This means it can perform better even when there's not much training data available.
  2. Traditional methods like cross-encoders and bi-encoders have limitations, like being slow or needing a lot of data. AugSBERT addresses these issues, making it more efficient for large-scale tasks.
  3. The approach combines the strengths of different models to enhance performance, especially in specific domains. It shows significant improvements over existing models, making it a useful tool for various natural language processing applications.
Mike Talks AI 98 implied HN points 19 May 23
  1. Consider a hybrid approach for data science teams to balance the strengths of both centralized and decentralized setups.
  2. Some companies are experimenting with intentionally rotating between centralized and decentralized structures every few years.
  3. Switching between centralization and decentralization periodically allows for exploration and scalability of diverse ideas within data science teams.
The Tech Buffet 79 implied HN points 19 Nov 23
  1. Creating a good dataset is important to evaluate your LLM-based applications. You can use LLMs to generate questions and answers from your data, which helps in building a reliable test set.
  2. Running your application over this dataset helps you see how well it retrieves information and generates answers. Keeping track of the documents it finds will make your evaluation easier.
  3. Finally, you should measure how well your application retrieves relevant documents and how good the answers are. This will help you understand what works best and where you can improve.
Data Science Weekly Newsletter 199 implied HN points 23 Mar 23
  1. This week's newsletter shares useful links in data science, machine learning, and AI. It's a great way to stay updated in these fields.
  2. One highlighted article discusses the importance of prompt engineering in interacting with language models. It's about how to communicate effectively with AI for desired results.
  3. There's also a report on how generative models like GPT might impact jobs. It shows that many workers could see changes in their tasks due to AI advancements.
Gradient Flow 199 implied HN points 15 Dec 22
  1. The recommended book of the year is a comprehensive guide for data scientists and data teams, offering practical advice and real-world insights in using data science effectively and ethically.
  2. ActivityPub is a W3C standard and decentralized social networking protocol, gaining traction as a viable alternative to centralized services for community building.
  3. SkyPilot, a newly launched project, presents a unified interface for running machine learning workloads on any cloud, catering to the need for cost-effective cloud computing in the coming year.
The Beep 39 implied HN points 25 Feb 24
  1. Multimodal search lets you look for information using different types of data like text, images, and audio at the same time. This makes finding what you need much easier and faster.
  2. Embeddings are special numbers that represent words, images, or sounds so computers can understand them. They help machines learn about relationships and contexts in the data they process.
  3. Using vector databases, we can store these embeddings efficiently. This technology enables smarter applications like image searches or recognizing songs quickly.
Think Future 79 implied HN points 02 Nov 23
  1. The importance of expertise in interpreting data findings - data can sometimes lead to nonsensical conclusions without proper expertise to guide the analysis.
  2. Be cautious of drawing conclusions solely based on data - critical thinking is essential to avoid errors in analysis, like the case of Trip Advisor's BBQ city rankings.
  3. Consulting with longtime experts is crucial before accepting data-driven findings as 'rock-solid' - having seasoned professionals review results can help prevent misinterpretations and errors.
Basta’s Notes 122 implied HN points 13 Jan 25
  1. Machine learning models are good at spotting patterns that humans might miss. This means they can make predictions and organize data in ways that are impressive and often very useful.
  2. However, machine learning can struggle with unclear or messy data. This fuzziness can lead to mistakes, like misidentifying objects or giving unexpected results.
  3. Not every problem needs a machine learning solution, and sometimes simpler methods work better and are more effective. It's important to think carefully about whether machine learning is truly the best tool for the job.
TheSequence 14 implied HN points 16 Nov 25
  1. World models are becoming more advanced, moving from simple image recognition to creating interactive 3D environments that agents can explore. This change means we need new tools and data to support these rich, dynamic models.
  2. AI coding tools are becoming essential for software development, with companies raising significant funds to enhance these technologies. This shift indicates that AI will play a crucial role in making coding more efficient and collaborative.
  3. Recent advancements in large language models are focused on making them more controllable and aligned with users' needs, improving their reliability for real-world applications.
Data Science Weekly Newsletter 199 implied HN points 16 Feb 23
  1. Visual analytics can help make deep learning models easier to understand. Researchers are working to fill gaps and challenges in this area.
  2. AI tools like ChatGPT might change how we visualize data in the future. They could make it easier to find and interpret information quickly.
  3. A new method called Lion offers a better optimization algorithm for training deep neural networks. It uses less memory than existing methods like Adam.
Kndrej’s Substack 3 HN points 14 Aug 24
  1. Breaking into machine learning (ML) requires not just basic knowledge but also a deep understanding of the math and engineering behind models. Completing online courses is only a starting point.
  2. Internships and real project experience are crucial for landing a job in ML. It's important to have skills that stand out, like publications or open-source contributions.
  3. Interview preparation is key; practicing coding challenges and understanding ML concepts is necessary to succeed. Networking and applying quickly to job postings can improve your chances.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 03 May 24
  1. Fine-tuning large language models (LLMs) can help them better understand and use long pieces of text. This means they can make sense of information not just at the start and end but also in the middle.
  2. The 'lost-in-the-middle' problem happens because LLMs often overlook important details in the middle of texts. Training them with more focused examples can help address this issue.
  3. The IN2 training approach emphasizes that crucial information can be found anywhere in long texts. It uses specially created question-answer pairs to teach models to pay attention to all parts of the context.