The hottest Data science Substack posts right now

And their main takeaways
Category
Top Technology Topics
Don't Worry About the Vase • 1164 implied HN points • 19 Dec 24
  1. The release of o1 into the API is significant. It enables developers to build applications with its capabilities, making it more accessible for various uses.
  2. Anthropic released an important paper about alignment issues in AI. It highlights some worrying behaviors in large language models that need more awareness and attention.
  3. There are still questions about how effectively AI tools are being used. Many people might not fully understand what AI can do or how to use it to enhance their work.
Data Science Weekly Newsletter • 259 implied HN points • 22 Mar 24
  1. Data storytelling is important for sharing insights, and AI can help people create better stories. The research looks at how different tools assist in each storytelling stage.
  2. Switching from R to Python in data science isn't just about learning new syntax; it's a mindset change. New Python tools can help make this transition smoother for users coming from R's tidyverse.
  3. Emerging technologies often face skepticism, as seen throughout history. New inventions have raised concerns about their impact, but they eventually become part of everyday life.
TheSequence • 42 implied HN points • 13 Jan 26
  1. Synthetic data generation is moving from ad-hoc scripts to full-fledged infrastructure frameworks that handle large-scale, repeatable data production.
  2. After human-written corpora are saturated, synthetic data becomes the main way to keep scaling foundation models — effectively a "second scaling law" for AI.
  3. Commercial stacks like NVIDIA's Nemotron-4 paired with NeMo are being positioned as turnkey synthetic data foundries for modern model training.
Data Science Weekly Newsletter • 379 implied HN points • 02 Feb 24
  1. Forecasting in data science is challenging because time series data can be non-stationary. Using the right evaluation methods can help bridge the gap between traditional and modern forecasting techniques.
  2. It's important to consider the smartness of your data structures. Creating overly complicated dashboards that ultimately just produce simple outputs may not be the best use of time.
  3. There are clear distinctions between well-built data pipelines and amateur setups. Understanding what makes a pipeline production-grade can improve the quality and reliability of data processing.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Democratizing Automation • 467 implied HN points • 04 Jun 25
  1. Next-gen reasoning models will focus on skills, calibration, strategy, and abstraction. These abilities help the models solve complex problems more effectively.
  2. Calibrating how difficult a problem is will help models avoid overthinking and make solutions faster and more enjoyable for users.
  3. Planning is crucial for future models. They need to break down complex tasks into smaller parts and manage context effectively to improve their problem-solving abilities.
Data Science Weekly Newsletter • 339 implied HN points • 09 Feb 24
  1. Satellite data is important for machine learning and should be treated as a unique area of research. Recognizing this can help improve how we use this data.
  2. Many data science and machine learning projects fail from the start due to common mistakes. Learning from past experiences can help increase the chances of success.
  3. Open source software plays a crucial role in advancing AI technology. It's important to support and protect open source AI from regulations that could harm its progress.
Democratizing Automation • 435 implied HN points • 09 Jun 25
  1. Reinforcement learning (RL) is getting better at solving tougher tasks, but it's not easy. There's a need for new discoveries and improvements to make these complex tasks manageable.
  2. Continual learning is important for AI, but it raises concerns about safety and can lead to unintended consequences. We need to approach this carefully to ensure the technology is beneficial.
  3. Using RL in sparser domains presents challenges, as the lack of clear reward signals makes improvement harder. Simple methods have worked before, but it’s uncertain if they will work for more complex tasks.
Security Is • 159 implied HN points • 02 May 24
  1. AI doesn't really fix security problems well. Many times, the technology just doesn't work in the tough, unpredictable environments that security deals with.
  2. The best results in security often come from simple, clear procedures, not from complex machine learning models. Basic rules can solve most problems effectively.
  3. Generative AI can help with minor tasks but isn't a magic solution for security. It might even confuse people about important issues, rather than clarify them.
Fish Food for Thought • 47 implied HN points • 31 Dec 25
  1. When tools make tasks cheaper and easier, we usually do more of those tasks, not less; efficiency expands demand and creates new uses.
  2. Automation tends to shift work, not eliminate it — machines handle repetitive parts while people take on harder, higher-value tasks like interpretation, edge cases, and oversight.
  3. AI will grow opportunities for engineers and data scientists by increasing the amount of software and systems to build, maintain, secure, and govern, shifting work toward architecture, judgment, and integration rather than rote coding.
Philosophy bear • 393 implied HN points • 24 Jun 25
  1. It's important to understand what Large Language Models (LLMs) can currently do and limit excessive philosophical concerns. Focusing on their real capabilities helps us appreciate their strengths and weaknesses better.
  2. Critics often overlook the achievements of LLMs, making broad claims without specific evidence of what these models can't do. A careful look at their limitations and abilities is needed for a fair assessment.
  3. When thinking about LLMs, we should be cautious about using complex concepts like 'thinking' or 'creativity.' It's better to focus on what these models can actually accomplish instead of getting caught up in vague definitions.
Technically • 24 implied HN points • 27 Jan 26
  1. Coding agents are the fastest-growing use case, with companies spending heavily on sandbox-based tooling and using the same tech for things like reinforcement learning.
  2. LLM inference is moving toward self-hosting with open-source models and inference engines so businesses can tune offline, online, and semi-online workloads, and spending on these OS stacks has surged.
  3. Science and B2B production use cases are steadily growing, showing AI is maturing from experiments into real enterprise deployments and driving rising infrastructure spend.
Data Science Weekly Newsletter • 159 implied HN points • 26 Apr 24
  1. Evaluating AI models can be expensive, but tools like lm-buddy and Prometheus help do it on cheaper hardware without high costs.
  2. Installing and deploying LLaMA 3 is made simple with clear guides that cover everything from setup to scaling effectively.
  3. Understanding best practices in machine learning is essential, and resources like the 'Rules of Machine Learning' provide valuable guidelines for beginners.
SeattleDataGuy’s Newsletter • 365 implied HN points • 19 Jun 25
  1. It's better to work with other experienced engineers early in your career. This way, you can learn from their decisions and improve your skills more quickly.
  2. Don't get distracted by flashy tech trends or buzzwords. Focus on solving real business problems instead of getting caught up in the hype.
  3. Communication is key in data roles. Make sure you understand your audience and always lead with the main point when sharing your work.
Data Science Weekly Newsletter • 419 implied HN points • 22 Dec 23
  1. Generative AI is changing how we work with tools, improving the Human-Tool Interface. This can help us use technology in ways we never could before.
  2. Support Vector Machines (SVMs) can be very effective for prediction tasks, often outperforming other models in error rates. However, they aren’t as commonly used, possibly due to their complexity.
  3. Deep multimodal fusion is useful in surgical training. It helps classify feedback from experienced surgeons to trainees by combining different types of data like text, audio, and video.
Mindful Modeler • 339 implied HN points • 23 Jan 24
  1. Quantile regression can be used for robust modeling to handle outliers and predict tail behavior, helping in scenarios where underestimation or overestimation leads to loss.
  2. It is important to choose quantile regression when predicting specific quantiles, such as upper quantiles, for scenarios like bread sales where under or overestimating can have financial impacts.
  3. Quantile regression can also be utilized for uncertainty quantification, and combining it with conformal prediction can improve coverage, making it useful for understanding and managing uncertainty in predictions.
John Ball inside AI • 39 implied HN points • 24 Jul 24
  1. You don't need many words to communicate in a new language. Just a small vocabulary can help you get by in everyday conversations.
  2. For understanding most spoken and written text, around 2000 words are usually enough. This covers about 80% of regular communication.
  3. Machine learning and AI can benefit from understanding language like humans do, by learning new words in context rather than just relying on a large vocabulary.
Tech Talks Weekly • 59 implied HN points • 26 Jul 24
  1. Tech Talks Weekly is a free email newsletter that shares recent talks from dozens of tech conferences. It's a great way to catch up on what you missed!
  2. Readers can participate by filling out a short form to help improve the content. This makes it a community-driven resource.
  3. The newsletter highlights popular talks each week, making it easier for people to discover valuable insights from experts in tech.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 119 implied HN points • 16 May 24
  1. AI agents can make decisions and take actions based on their environment. They operate at different levels of complexity, with level one being simple rule-based systems.
  2. Currently, AI agents are improving rapidly, sitting at levels two and three, where they can automate tasks and manage sequences of actions effectively.
  3. The future of AI agents is bright, as they will be more integrated into various industries, but we need to consider issues like accountability and ethics when designing and implementing them.
Data Science Weekly Newsletter • 139 implied HN points • 03 May 24
  1. Reusing data analysis work can save time and help teams focus on building new capabilities instead of just repeating old ones.
  2. Open-source models can be a better choice than proprietary ones for developing AI applications, making them cheaper and faster.
  3. Causal machine learning helps predict treatment outcomes by personalizing clinical decisions based on individual patient data.
Abstraction • 39 implied HN points • 02 Jan 26
  1. Forecasting bots can run continuously, answer many questions, and be scored in real time, turning forecasting from a slow craft into a fast, repeatable process.
  2. Large, scored tournaments and shared datasets will let people empirically test different methods and finally learn which forecasting approaches actually work at scale.
  3. Simple heuristics get you most of the way there, but reaching the frontier requires deeper techniques and open sharing of methods to accelerate progress.
SeattleDataGuy’s Newsletter • 365 implied HN points • 05 Jun 25
  1. Hype around data and AI can distract companies from their real goals. It's important to focus on what data can actually do for your business, instead of getting lost in the trend.
  2. Most businesses don't rely on data as their main product. Even if data can improve their operations, it’s not their primary focus, so the challenge is making data truly useful.
  3. Companies often look up to big tech for data strategies, but they have different resources. Chasing after their methods without understanding your own needs can lead to a misguided strategy.
The Counterfactual • 599 implied HN points • 28 Jul 23
  1. Large language models, like ChatGPT, work by predicting the next word based on patterns they learn from tons of text. They don’t just use letters like we do; they convert words into numbers to understand their meanings better.
  2. These models handle the many meanings of words by changing their representation based on context. This means that the same word could have different meanings depending on how it's used in a sentence.
  3. The training of these models does not require labeled data. Instead, they learn by guessing the next word in a sentence and adjusting their processes based on whether they are right or wrong, which helps them improve over time.
Abstraction • 29 implied HN points • 14 Jan 26
  1. Do a pre-mortem: assume the forecast is wrong and list plausible ways it could fail (like cancellations, acquisitions, or shifted definitions) so you don’t miss important paths.
  2. Run a sanity check to make sure the probability fits basic world knowledge and common sense, and correct obvious errors like using the wrong base rate.
  3. Make these checks the final gate: if either one flags a problem, rework the forecast or use a different approach before submitting.
Data Science Weekly Newsletter • 119 implied HN points • 10 May 24
  1. Time-series analysis and Gaussian processes are powerful tools for interpreting data. They allow for flexibility and control in modeling data, making them essential for data practitioners.
  2. Understanding A/B testing is crucial for making informed business decisions. Using a reliable experimentation system can save time and lead to better results.
  3. New advancements in AI and data science are enhancing applications in various fields, like biomedical research and recommendation systems. These innovations help combine human creativity with machine learning capabilities.
The AI Frontier • 119 implied HN points • 09 May 24
  1. Open LLMs, like Llama 3, are getting really good and can perform well in many tasks. This improvement makes them a strong option for various applications.
  2. Fine-tuning open LLMs is becoming more attractive because of their improved quality and lower costs. This means smaller, specialized models can be more easily developed and used.
  3. However, open models likely won't surpass OpenAI's offerings. The proprietary models have a big advantage, but open LLMs can still thrive by focusing on efficiency and specific use cases.
Abstraction • 34 implied HN points • 07 Jan 26
  1. Do a quick "broken leg" check first because a decisive news event can resolve a question immediately and save the time and cost of running the full forecasting pipeline.
  2. Be cautious: a wrongly triggered broken-leg update is dangerous since proper scoring heavily penalizes confident incorrect forecasts, so false positives can wipe out gains.
  3. Treat it as an empirical trade-off: implement a news-based detector, clearly define what "overwhelmingly resolves" means, track when it fires, and tune thresholds, confidence damping, or disable it if blowouts outweigh the savings.
Data Science Weekly Newsletter • 179 implied HN points • 29 Mar 24
  1. SQL is seen as an easier way to write relational algebra, but it's not ideal for building new query tools. Understanding its limits can help in learning and using SQL better.
  2. Many successful companies have developed their own AI models, showing a trend in the tech industry. Knowing about these companies can give insights into future developments in AI.
  3. Binary vector search methods can save a lot of memory compared to traditional methods. However, it's important to balance memory savings with maintaining accuracy.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 18 Jul 24
  1. Large Language Models (LLMs) can create useful text but often struggle with specific knowledge-based questions. They need better ways to understand the question's intent.
  2. Retrieval-augmented generation (RAG) systems try to solve this by using extra knowledge from sources like knowledge graphs, but they still make many mistakes.
  3. The Mindful-RAG approach focuses on understanding the question's intent more clearly and finding the right context in knowledge graphs to improve answers.
Owen’s Substack • 59 implied HN points • 19 Jul 24
  1. Triplex is a new tool that helps create knowledge graphs quickly and cheaply. It's much cheaper to use than older methods, making it easier for more people to utilize.
  2. This tool is small enough to run on regular laptops, which means you don't need powerful computers to build knowledge graphs. This makes technology more accessible to everyone.
  3. Triplex is open-source, allowing anyone to use and improve it. The community can experiment with it freely and innovate new ways to organize and understand information.
Data Science Weekly Newsletter • 199 implied HN points • 14 Mar 24
  1. Serverless computing can handle big tasks without limits, but it also brings challenges like managing large uploads effectively.
  2. Art careers can be influenced by the reputation of institutions, with established artists facing less access to elite spaces early on compared to newcomers.
  3. Learning about LLM evaluation metrics can help improve understanding and performance when working with large language models.
The Algorithmic Bridge • 605 implied HN points • 28 Feb 25
  1. GPT-4.5 is not as impressive as expected, but it's part of a plan for bigger advancements in the future. OpenAI is using this model to build a better foundation for what's to come.
  2. Despite being larger and more expensive, GPT-4.5 isn't leading in new capabilities compared to older models. It's more focused on creativity and communication, which might not appeal to all users.
  3. OpenAI wants to improve the basic skills of AI rather than just aiming for high scores in tests. This step back is meant to ensure future models are smarter and more capable overall.
The AI Frontier • 159 implied HN points • 04 Apr 24
  1. Current methods for evaluating language models (LLMs) are not effective because they try to give one-size-fits-all answers. Each LLM is better suited for different tasks, so we need evaluations that reflect that.
  2. It’s important to look at specific skills of LLMs, like how well they follow instructions or retrieve information. This will help users understand which model works best for their needs.
  3. We need more detailed benchmarks that assess individual capabilities rather than general performance scores. This way, developers can make smarter choices when selecting LLMs for their projects.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 13 Aug 24
  1. RAG Foundry is an open-source framework that helps make the use of Retrieval-Augmented Generation systems easier. It brings together data creation, model training, and evaluation into one workflow.
  2. This framework allows for the fine-tuning of large language models like Llama-3 and Phi-3, improving their performance with better, task-specific data.
  3. There is a growing trend in using synthetic data for training models, which helps create tailored datasets that match specific needs or tasks better.
Data Science Weekly Newsletter • 359 implied HN points • 15 Dec 23
  1. Learning about causal models is important in data analysis because it helps explain what caused the data. This understanding can improve how we interpret results using Bayesian methods.
  2. There's growing concern over data privacy in AI tools like Dropbox. Users are worried their private files could be used for AI training, even though companies deny this.
  3. Netflix recently held a Data Engineering Forum to share best practices. They discussed ways to improve data pipelines and processing, which could benefit many in the data engineering community.
SeattleDataGuy’s Newsletter • 341 implied HN points • 27 May 25
  1. Apache Iceberg might seem appealing, but it won't automatically solve your data problems. It's important to really understand what issues you're trying to address before jumping in.
  2. Switching to new tools like Iceberg won't fix a broken data strategy. The focus should be on delivering real business value, not just adopting the latest technology.
  3. If your data team is already doing well and looking to improve, Iceberg could be useful. But make sure it's the right fit for your specific challenges instead of following trends.
Technically • 14 implied HN points • 05 Feb 26
  1. Modern generative models mirror pathways in the human brain, and many researchers believe leveraging that similarity could be key to much stronger AI.
  2. Real cloud-spend data shows the fastest-growing AI use cases are coding agents, low-latency LLM inference, and computational biology, while AI art and video generation have plateaued as the market professionalizes.
  3. Models overuse em dashes mainly because of their training data and tokenization quirks—older texts and auto-converted punctuation make the em dash common—and this highlights how dataset quality and representativeness drive model behavior.
Abstraction • 29 implied HN points • 09 Jan 26
  1. A single probability for a time window needs a decay model because where the probability mass sits across the window determines how much chance remains as time passes.
  2. Probability can follow different hazard patterns—constant (linear decay), increasing (back-loaded, like last‑minute negotiations), decreasing (front‑loaded, like ceasefires), or event‑driven—and each pattern changes how fast the cumulative probability is consumed over time.
  3. The forecasting bot classifies which hazard applies (defaulting to constant when unsure) and uses that to update remaining probability as time elapses, but this is a refinement that can be misclassified and matters most for long‑horizon questions.
Data Science Weekly Newsletter • 139 implied HN points • 12 Apr 24
  1. This newsletter provides links and updates about data science, AI, and machine learning. It's a helpful resource for anyone wanting to stay informed in this field.
  2. One article teaches how to handle real questions using Python, which is great for people wanting practical coding skills. Another discusses techniques to make sure AI outputs stay on task.
  3. The newsletter also features resources and courses to help people learn and improve their skills in data science and related areas. It's a good place to find learning opportunities.