The hottest Data science Substack posts right now

And their main takeaways
Category
Top Technology Topics
The Tech Buffet 159 implied HN points 04 Sep 23
  1. Building a custom chatbot helps in getting accurate answers from specific internal data without the risk of it making things up. This is especially useful for specialized knowledge.
  2. Using a chatbot saves time and makes it super easy to find information quickly, boosting productivity for users.
  3. You can keep improving and updating the bot as your data changes, and you have full control over privacy by using open-source tools.
TheSequence 98 implied HN points 20 Jun 25
  1. V-JEPA 2 is an advanced AI model from Meta that improves how machines learn about the world without needing labeled data. It builds on the original V-JEPA framework and aims for better understanding and modeling of environments.
  2. The new version enhances architectural size and training methods, allowing the AI to make predictions about its surroundings more effectively. This could lead to smarter and more capable AI systems.
  3. With V-JEPA 2, we are moving closer to creating AI that can think and act on its own, resembling human-like reasoning. This is an exciting step towards achieving more advanced AI technologies.
TheSequence 91 implied HN points 01 Jul 25
  1. Multi-agent benchmarks are important now because they test how AI agents can work together, unlike old methods that focused on just one agent at a time.
  2. These new benchmarks help us see how well AI can handle tasks that involve teamwork and communication in changing environments.
  3. As AI gets better, understanding how these systems interact will be key to unlocking smarter, more capable AI behavior.
Data Science Weekly Newsletter 259 implied HN points 26 May 23
  1. AI has great potential to improve our lives but also comes with risks if misused. It's important to balance optimism and caution.
  2. Tools like Copilot in Power BI make it easier for users to analyze and visualize data by allowing them to communicate their needs in plain language.
  3. The concept of the 'Curse of Dimensionality' shows that sometimes having too much data can confuse models instead of helping them make better predictions.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 59 implied HN points 11 Mar 24
  1. Small Language Models (SLMs) can effectively handle specific tasks without needing to be large. They are more focused on doing certain jobs well rather than trying to be everything at once.
  2. The Orca 2 model aims to enhance the reasoning abilities of smaller models, helping them outperform even bigger models when reasoning tasks are involved. This shows that size isn't everything.
  3. Training with tailored synthetic data helps smaller models learn better strategies for different tasks. This makes them more efficient and useful in various applications.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Normcore Tech 1155 implied HN points 28 Feb 23
  1. The landscape of social media is changing with platforms like Twitter and Facebook losing users to newer platforms like TikTok
  2. Users are moving to private, fragmented social media landscapes with platforms like Discord and Mastodon
  3. Creators are facing challenges in standing out in the mass-creation of art facilitated by tools like ChatGPT and StableDiffusion
TheSequence 119 implied HN points 16 May 25
  1. Leaderboards in AI help direct research by showing who is doing well, but they can also create problems. They might not show the whole picture of how models really perform.
  2. The Chatbot Arena is a way to judge AI models based on user choices, but it has issues that make it unfair. Some big labs can take advantage of the system more than smaller ones.
  3. To make AI evaluations better, there need to be rules that ensure fairness and transparency. This way, everyone gets a fair chance in the AI race.
Data Science Weekly Newsletter 199 implied HN points 28 Jul 23
  1. Large language models use complex methods like word vectors and transformers to understand language, but this can be explained simply without heavy math. They need a lot of data to perform well.
  2. Using AI tools like ChatGPT for real-world programming tasks can streamline the coding process, as it allows for a more focused workflow without switching between different resources.
  3. Building effective data storage systems, like Amazon S3, involves overcoming interesting challenges and nuances, demonstrating the amazing technology behind big data management.
The Tech Buffet 39 implied HN points 23 Apr 24
  1. Weaviate is a powerful vector database that helps in creating advanced AI applications. It's useful for managing large amounts of data and performing semantic searches efficiently.
  2. When working with Weaviate, you can easily load and index data, allowing for quick access to information. This makes it easier to build systems that need to handle a lot of data quickly.
  3. Weaviate supports different search methods like vector search, keyword search, and hybrid search. This way, you can find the most relevant results based on your needs.
Democratizing Automation 245 implied HN points 26 Nov 24
  1. Effective language model training needs attention to detail and technical skills. Small issues can have complex causes that require deep understanding to fix.
  2. As teams grow, strong management becomes essential. Good managers can prioritize the right tasks and keep everyone on track for better outcomes.
  3. Long-term improvements in language models come from consistent effort. It’s important to avoid getting distracted by short-term goals and instead focus on sustainable progress.
Data Science Weekly Newsletter 299 implied HN points 06 Apr 23
  1. Understanding linear programming can help solve complex problems using Python. It's useful in various fields and can optimize outcomes.
  2. MLOps is closely related to data engineering, showing that managing data for machine learning involves more engineering than initially thought.
  3. The new pandas 2.0 version has exciting features like the Apache Arrow backend, which will enhance its performance and capabilities.
From the New World 188 implied HN points 28 Jan 25
  1. DeepSeek has released a new AI model called R1, which can answer tough scientific questions. This model has quickly gained attention, competing with major players like OpenAI and Google.
  2. There's ongoing debate about the authenticity of DeepSeek's claimed training costs and performance. Many believe that its reported costs and results might not be completely accurate.
  3. DeepSeek has implemented several innovations to enhance its AI models. These optimizations have helped them improve performance while dealing with hardware limits and developing new training techniques.
Data Science Weekly Newsletter 319 implied HN points 09 Mar 23
  1. The newsletter shares interesting links about data science, machine learning, and AI each week. It’s a good way to keep up with new trends and knowledge in the field.
  2. There's a discussion on what databases should do but often don’t. Understanding these gaps can help you improve your data projects by knowing what to build yourself.
  3. AI's impact on jobs and industries is being researched, especially how language models like ChatGPT could change certain occupations. It's important to understand how AI can affect your career choices.
Interconnected 246 implied HN points 18 Nov 24
  1. The scaling law for AI models might be losing effectiveness, meaning that simply using more data and compute power may not lead to significant improvements like it did before.
  2. US export controls on AI technology may become less impactful over time, as diminishing returns on AI model scaling could lessen the advantages of having the most advanced hardware.
  3. If AI development slows down, the urgency for a potential 'AI doomsday' scenario may decrease, allowing for a more balanced competition between the US and China in AI advancements.
Data Science Weekly Newsletter 219 implied HN points 23 Jun 23
  1. AI technology is advancing quickly and can even cover public meetings, but we need to think carefully about its readiness for everyday use.
  2. Engineers can improve their people skills and interactions by applying the same problem-solving mindset they use in their technical work.
  3. Generative AI is becoming important in data science for creating synthetic data, which helps in privacy and enhances analysis without losing useful information.
TheSequence 77 implied HN points 16 Jul 25
  1. Kimi K2 is a huge open source AI model with a trillion parameters, which makes it very powerful. It's important to know about advancements like this, especially as they can change how we use AI.
  2. The model uses a special design called Mixture-of-Experts that improves its efficiency. This means it can perform tasks better by only activating the parts it needs to.
  3. Kimi K2 shows strong performance in areas like coding and reasoning. This highlights how rapidly AI is evolving, and we need to keep up with newer developments from around the world.
TheSequence 84 implied HN points 03 Jul 25
  1. Circuits are important for understanding how AI works, especially in transformer models. They help researchers see how different parts of the model work together.
  2. The circuits approach looks at groups of neurons that interact to perform tasks, not just single neurons. This helps in understanding the flow of information in AI.
  3. While circuits show promise for making AI more understandable, they might not be the only solution. There's still a lot to explore about how to really interpret these complex models.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 14 Jun 24
  1. DR-RAG improves how we find information for question-answering by focusing on both highly relevant and less obvious documents. This helps to ensure we get accurate answers.
  2. The process uses a two-step method: first, it retrieves the most relevant documents, then it connects those with other documents that might not be directly related, but still helps in forming the answer.
  3. This method shows that we often need to look at many documents together to answer complex questions, instead of relying on just one document for all the needed information.
Gradient Flow 259 implied HN points 26 Jan 23
  1. The need for tools to help developers pick models that fit their needs and understand model limitations as general-purpose models are widely used.
  2. Data science teams are tackling automation and early examples targets aspects of projects like modeling and coding assistance, but further advancements are needed.
  3. There's a shortage of research and tools for experimentation and optimization in data science, creating opportunities for entrepreneurs to deliver innovative solutions.
The Tech Buffet 99 implied HN points 18 Dec 23
  1. You can automate the testing of Retrieval Augment Generation (RAG) systems without needing to label data yourself. This makes it faster and easier to evaluate their performance.
  2. Generating synthetic datasets with questions and answers allows you to test how well your RAG performs. This method helps you understand the effectiveness of your application and provides useful insights.
  3. Using various metrics is key to evaluating your RAG accurately. This way, you assess different aspects of performance, ensuring you get a well-rounded view of how your system is doing.
Data Science Weekly Newsletter 219 implied HN points 16 Jun 23
  1. Using large language models can help kids learn to ask curious questions by automating the teaching process.
  2. New techniques for 3D space reconstruction can make indoor views on platforms like Google Maps look more realistic and interactive.
  3. There's a growing need to understand the value of personal data in online shopping, especially as new regulations come into play.
Technically 24 implied HN points 11 Nov 25
  1. Reinforcement Learning from Human Feedback (RLHF) makes AI models like ChatGPT more helpful by showing them what good answers look like. It teaches them how to be useful assistants instead of just being knowledgeable.
  2. Before RLHF, AI models could give correct but irrelevant answers, like a toddler with a lot of knowledge but no idea how to apply it. They often generated strange or confusing responses.
  3. The process of RLHF includes humans ranking AI-generated answers, which helps refine the models. This way, they learn to be more concise and relevant to our needs.
Basta’s Notes 753 HN points 15 Sep 23
  1. Sometimes, valuable projects end abruptly without much recognition or lasting impact.
  2. It's important to focus on creating business value with your work, rather than building impressive but ultimately unnecessary solutions.
  3. Every piece of code you write as an engineer is legacy and may not last forever, so focus on learning from each project's outcome.
The Tech Buffet 139 implied HN points 10 Oct 23
  1. RAG systems can produce impressive results but require careful tuning to be reliable in real-world applications. Just copying and pasting code won't necessarily work for complex use cases.
  2. Understanding the RAG framework is important, as it involves various components like data loaders, splitters, and embedding models. Each part plays a crucial role in generating accurate answers.
  3. Using frameworks like LangChain can simplify the process of prototyping RAG systems, but they still need thoughtful configuration to function effectively in production.
Data Science Weekly Newsletter 239 implied HN points 19 May 23
  1. Absence of evidence can often serve as strong evidence of absence, and this idea can be explored with Bayesian methods.
  2. Natural language processing is being used to analyze global supply chains, helping create networks from news articles.
  3. It's crucial to understand the unique challenges and opportunities in personalizing search results, as seen with Netflix's approach.
High ROI Data Science 357 implied HN points 27 Feb 23
  1. Many data scientists in companies that don't prioritize data science end up doing basic reporting and analytics.
  2. Technical management in such companies often lack the understanding and incentives to support data initiatives.
  3. Navigating a lack of data culture and strategy in a company requires significant effort but can lead to valuable career opportunities.
LatchBio 82 implied HN points 27 Jun 25
  1. LatchBio has created a massive cell atlas with 30 million samples covering 150 diseases and 200 tissues. This helps researchers access diverse biological data easily.
  2. They partnered with Pythia Biosciences and Miraomics to enhance data curation and improve how this information is delivered to users.
  3. The introduction of a new Python framework helps scientists curate data more efficiently, making it easier to handle complex biological information.
TheSequence 14 implied HN points 16 Dec 25
  1. Multiturn data synthesis treats data generation as an interactive, multi-step process where agents act, react, and revise instead of producing a single-shot answer.
  2. That interactive approach produces richer supervision—dialogues, plans, error corrections, edit sequences, and verifier outcomes—which teaches models how to reach an answer, not just what the answer is.
  3. Self-play methods (for example Reflexion) use these multi-turn synthetic traces so agents can iteratively improve, which helps train capabilities like tool use, coding, browsing, negotiation, and safety.
Data Science Weekly Newsletter 219 implied HN points 09 Jun 23
  1. Data modeling in data science is complex and often messy, making it hard to get reliable answers. This issue highlights the need for better practices and understanding in this area.
  2. There are ongoing discussions about the realities of working in data science. Sharing these experiences can help others prepare for the challenges they may face.
  3. Generative AI is a big topic right now, and there are frameworks being developed to help organizations strategize its use effectively. Exploring these can guide businesses in adopting AI responsibly.
Aziz et al. Paper Summaries 59 implied HN points 20 Mar 24
  1. Step Back Prompting helps models think about big ideas before answering questions. This method shows better results than other prompting techniques.
  2. Even with Step Back Prompting, models still find it tricky to put all their reasoning together. Many errors come from the final reasoning step which can be complicated.
  3. Not every question works well with Step Back Prompting. Some questions need quick, specific answers instead of a longer thought process.
Data Science Weekly Newsletter 279 implied HN points 30 Mar 23
  1. This week's newsletter features discussions on AI and its potential risks, highlighting different viewpoints on the future of technology.
  2. Career development in data science is important. There are resources and talks from experts that focus on skills that help you succeed in this field.
  3. New updates in the Tidyverse can improve your coding experience in data science, making it easier and more efficient to work with data.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 11 Jun 24
  1. Tree of Thoughts (ToT) is a new way to solve complex problems with language models by exploring multiple ideas instead of just one.
  2. It breaks down problems into smaller 'thoughts' and evaluates different paths, similar to how humans think through problems.
  3. ToT allows models to understand not just the solution but also the reasoning behind it, making decision-making more deliberate.
Gradient Flow 199 implied HN points 23 Mar 23
  1. Alignment in AI is crucial to ensure that AI systems behave in beneficial and secure ways by aligning goals with human values and objectives.
  2. To start aligning AI systems effectively, teams can use methodologies like human-in-the-loop testing, adversarial training, model interpretability, and value alignment algorithms.
  3. Emphasizing alignment early on in AI development can help teams avoid ethical and legal issues and build trust with stakeholders and users by formalizing existing practices and expanding alignment tools.
Gonzo ML 189 implied HN points 04 Jan 25
  1. The Large Concept Model (LCM) aims to improve how we understand and process language by focusing on concepts instead of just individual words. This means thinking at a higher level about what ideas and meanings are being conveyed.
  2. LCM uses a system called SONAR to convert sentences into a stable representation that can be processed and then translated back into different languages or forms without losing the original meaning. This creates flexibility in how we communicate.
  3. This approach can handle long documents more efficiently because it represents ideas as concepts, making processing easier. This could improve applications like summarization and translation, making them more effective.
Franz likes to code 1 HN point 16 Sep 24
  1. Google Correlate was a tool for finding related search patterns, similar to Google Trends, but it was shut down in 2019.
  2. You can create a personal alternative using publicly available data, like Wikipedia page views, by scraping and analyzing it with Python.
  3. Using methods like similarity searches and cosine distance, you can identify articles that have similar view patterns to a given topic.
Democratizing Automation 150 implied HN points 19 Feb 25
  1. New datasets for deep learning models are appearing, but choosing the right one can be tricky.
  2. China is leading in AI advancements by releasing strong models with easy-to-use licenses.
  3. Many companies are developing reasoning models that improve problem-solving by using feedback and advanced training methods.
Enterprise AI Trends 337 implied HN points 11 Jul 24
  1. AI spending is still worth it because it can help big cloud providers move data to their services. This could open up a big opportunity for revenue, making the investment seem less risky.
  2. Most of the useful AI work happens behind the scenes and isn't visible to the public. This means many people might underestimate how much AI is actually helping businesses already.
  3. Companies are really committed to using generative AI and are treating it as a top priority. This commitment means we'll likely see more successful projects in the future.
TheSequence 189 implied HN points 29 Dec 24
  1. Artificial intelligence is moving from preference tuning to reward optimization for better alignment with human values. This change aims to improve how models respond to our needs.
  2. Preference tuning has its limits because it can't capture all the complexities of human intentions. Researchers are exploring new reward models to address these limitations.
  3. Recent models like GPT-o3 and Tülu 3 showcase this evolution, showing how AI can become more effective and nuanced in understanding and generating language.
Tech Talks Weekly 19 implied HN points 28 Jun 24
  1. The Tech Talks Weekly shares new tech conference talks each week, so you can catch up on the latest ideas without scrolling through messy video lists.
  2. This week features talks from major events like the React Summit and PyCon, covering a variety of topics in programming and tech.
  3. You can help grow the Tech Talks community by sharing it with friends and filling out a short form to provide feedback.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 02 Apr 24
  1. As RAG systems evolve, they are integrating more smart features to enhance their effectiveness. This means they are not just providing basic responses but are becoming more advanced and adaptable.
  2. The challenges with RAG include static rules for retrieving data and the problem of excessive tokens during processing. These issues can slow down performance and reduce efficiency.
  3. FIT-RAG is addressing these challenges with new tools, like a special document scorer and token reduction strategies, to improve how information is retrieved and used. This helps RAG systems provide better answers while using fewer resources.