The hottest Data science Substack posts right now

And their main takeaways
Category
Top Technology Topics
Encyclopedia Autonomica 19 implied HN points 02 Nov 24
  1. Google Search is becoming less reliable due to junk content and SEO tricks, making it harder to find accurate information.
  2. SearchGPT and similar tools are different from traditional search engines. They retrieve information and summarize it instead of just showing ranked results.
  3. There's a risk that new search tools might not always provide neutral information. It's important to ensure that users can still find quality sources without bias.
The Kaitchup – AI on a Budget 59 implied HN points 01 Nov 24
  1. SmolLM2 offers alternatives to popular models like Qwen2.5 and Llama 3.2, showing good performance with various versions available.
  2. The Layer Skip method improves the speed and efficiency of Llama models by processing some layers selectively, making them faster without losing accuracy.
  3. MaskGCT is a new text-to-speech model that generates high-quality speech without needing text alignment, providing better results across different benchmarks.
arg min 218 implied HN points 31 Oct 24
  1. In optimization, there are three main approaches: local search, global optimization, and a method that combines both. They all aim to find the best solution to minimize a function.
  2. Gradient descent is a popular method in optimization that works like local search, by following the path of steepest descent to improve the solution. It can also be viewed as a way to solve equations or approximate values.
  3. Newton's method, another optimization technique, is efficient because it converges quickly but requires more computation. Like gradient descent, it can be interpreted in various ways, emphasizing the interconnectedness of optimization strategies.
Holly’s Newsletter 2916 implied HN points 18 Oct 24
  1. ChatGPT and similar models are not thinking or reasoning. They are just very good at predicting the next word based on patterns in data.
  2. These models can provide useful information but shouldn't be trusted as knowledge sources. They reflect training data biases and simply mimic language patterns.
  3. Using ChatGPT can be fun and helpful for brainstorming or getting starting points, but remember, it's just a tool and doesn't understand the information it presents.
arg min 178 implied HN points 29 Oct 24
  1. Understanding how optimization solvers work can save time and improve efficiency. Knowing a bit about the tools helps you avoid mistakes and make smarter choices.
  2. Nonlinear equations are harder to solve than linear ones, and methods like Newton's help us get approximate solutions. Iteratively solving these systems is key to finding optimal results in optimization problems.
  3. The speed and efficiency of solving linear systems can greatly affect computational performance. Organizing your model in a smart way can lead to significant time savings during optimization.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
The Kaitchup – AI on a Budget 39 implied HN points 31 Oct 24
  1. Quantization helps reduce the size of large language models, making them easier to run, especially on consumer GPUs. For instance, using 4-bit quantization can shrink a model's size by about a third.
  2. Calibration datasets are crucial for improving the accuracy of quantization methods like AWQ and AutoRound. The choice of the dataset impacts how well the quantization performs.
  3. Most quantization tools use a default English-language dataset, but results can vary with different languages and datasets. Testing various options can lead to better outcomes.
Exploring Language Models 3289 implied HN points 07 Oct 24
  1. Mixture of Experts (MoE) uses multiple smaller models, called experts, to help improve the performance of large language models. This way, only the most relevant experts are chosen to handle specific tasks.
  2. A router or gate network decides which experts are best for each input. This selection process makes the model more efficient by activating only the necessary parts of the system.
  3. Load balancing is critical in MoE because it ensures all experts are trained equally, preventing any one expert from becoming too dominant. This helps the model to learn better and work faster.
The Kaitchup – AI on a Budget 179 implied HN points 28 Oct 24
  1. BitNet is a new type of AI model that uses very little memory by representing each parameter with just three values. This means it uses only 1.58 bits instead of the usual 16 bits.
  2. Despite using lower precision, these '1-bit LLMs' still work well and can compete with more traditional models, which is pretty impressive.
  3. The software called 'bitnet.cpp' allows users to run these AI models on normal computers easily, making advanced AI technology more accessible to everyone.
Simplicity is SOTA 1048 implied HN points 09 Mar 26
  1. Claude Code and similar agentic LLM tools can massively speed up coding and data workflows by reading and editing local files, running commands, and generating code and analyses.
  2. Human judgement and project infrastructure matter: give clear instructions, unit tests, caching, and command-line tools so the AI can check its work and avoid slow or flaky steps.
  3. The tool is excellent for coding and reproducible data pipelines but is less reliable for deep qualitative research unless you provide careful prompts, critical framing, and iterative review.
Experimental History 35142 implied HN points 05 Aug 25
  1. AI should not be thought of as a person; it's more like a 'bag of words.' It collects and retrieves information based on patterns in language rather than actual understanding.
  2. When using AI, remember it has limitations. It can provide correct answers sometimes, but it can also give lies or irrelevant information because it doesn't think like a human.
  3. Don't treat AI as a competitor. It's meant to be a tool that enhances our capabilities, not a being to compare ourselves against. It's all about how we can use it to improve our own skills.
Artificial Corner 158 implied HN points 23 Oct 24
  1. Jupyter Notebook is a popular tool for data science that combines live code with visualizations and text. It helps users organize their projects in a single place.
  2. Jupyter Notebook can be improved with extensions, which can add features like code autocompletion and easier cell movement. These tools make coding more efficient and user-friendly.
  3. To install these extensions, you can use specific commands in the command prompt. Once installed, you'll find new options that can help increase your productivity.
The Kaitchup – AI on a Budget 159 implied HN points 21 Oct 24
  1. Gradient accumulation helps train large models on limited GPU memory. It simulates larger batch sizes by summing gradients from several smaller batches before updating model weights.
  2. There has been a problem with how gradients were summed during gradient accumulation, leading to worse model performance. This was due to incorrect normalization in the calculation of loss, especially when varying sequence lengths were involved.
  3. Hugging Face and Unsloth AI have fixed the gradient accumulation issue. With this fix, training results are more consistent and effective, which might improve the performance of future models built using this technique.
VuTrinh. 659 implied HN points 10 Sep 24
  1. Apache Spark uses a system called Catalyst to plan and optimize how data is processed. This system helps make sure that queries run as efficiently as possible.
  2. In Spark 3, a feature called Adaptive Query Execution (AQE) was added. It allows the tool to change its plans while a query is running, based on real-time data information.
  3. Airbnb uses this AQE feature to improve how they handle large amounts of data. This lets them dynamically adjust the way data is processed, which leads to better performance.
The Kaitchup – AI on a Budget 59 implied HN points 25 Oct 24
  1. Qwen2.5 models have been improved and now come in a 4-bit version, making them efficient for different hardware. They perform better than previous models on many tasks.
  2. Google's SynthID tool can add invisible watermarks to AI-generated text, helping to identify it without changing the text's quality. This could become a standard practice to distinguish AI text from human writing.
  3. Cohere has launched Aya Expanse, new multilingual models that outperform many existing models. They took two years to develop, involving thousands of researchers, enhancing language support and performance.
Astral Codex Ten 30146 implied HN points 08 Jul 25
  1. In 2022, a bet was made on whether AI could create complex images by 2025. The challenge was to generate images that matched detailed prompts.
  2. Over the years, various AI models were tested, and the results showed both progress and limitations. Improvements were made, but some details were still missed.
  3. By June 2025, an updated AI model finally met all the conditions of the bet, showing that AI can achieve a high level of image generation based on specific instructions.
The Kaitchup – AI on a Budget 179 implied HN points 17 Oct 24
  1. You can create a custom AI chatbot easily and cheaply now. New methods make it possible to train smaller models like Llama 3.2 without spending much money.
  2. Fine-tuning a chatbot requires careful preparation of the dataset. It's important to learn how to format your questions and answers correctly.
  3. Avoiding common mistakes during training is crucial. Understanding these pitfalls will help ensure your chatbot works well after it's trained.
Bite code! 1223 implied HN points 05 Feb 26
  1. UVX.sh lets anyone install and run CLI tools published on PyPI without needing a local Python setup, making one-shot installs and sharing tools much faster and simpler.
  2. Pandas 3 changes defaults to real string dtypes, enforces consistent copy-on-write for indexing to avoid surprising mutations, and adds a functional col API to encourage clearer and faster data transformations.
  3. Oxyde is an async-first ORM with Pydantic typing, Django-like ergonomics, built-in migrations, and n+1 safety nets, offering high performance and modern ergonomics but still being early-stage for critical long-term projects.
Marcus on AI 16599 implied HN points 12 Aug 25
  1. Large language models (LLMs) are not like humans. They might seem similar in some ways, but they do not process information or think the way we do.
  2. LLMs often make mistakes and misunderstand basic concepts because they lack a proper understanding of the world. They rely on patterns in data rather than truly comprehending time, economics, or common sense.
  3. Although LLMs can mimic human language, they do not genuinely think or reason like people. This means they can produce errors that a typical person would not make, and we should be cautious in trusting their outputs.
SeattleDataGuy’s Newsletter 1165 implied HN points 23 Jan 26
  1. Practice analytical intuition by doing rough estimates, breaking problems into proxy values, understanding baselines and natural variance, and always running manual spot checks instead of blindly trusting tooling.
  2. When a metric moves unexpectedly, first confirm the data with multiple sources, then generate and test product, market, user, and external hypotheses to pinpoint the root cause and escalate with concrete analysis.
  3. Choose KPIs that are relevant, measurable, specific, prioritized, and balanced — pick the right type (North Star, top-level, secondary, or OMTM), avoid vanity metrics, and use simple, trusted proxy metrics tailored to your product.
arg min 317 implied HN points 08 Oct 24
  1. Interpolation is a process where we find a function that fits a specific set of input and output points. It's a useful tool for solving problems in optimization.
  2. We can build more complex function fitting problems by combining simple interpolation constraints. This allows for greater flexibility in how we define functions.
  3. Duality in convex optimization helps solve interpolation problems, enabling efficient computation and application in areas like machine learning and control theory.
Marcus on AI 17785 implied HN points 13 Jul 25
  1. Neurosymbolic AI combines two types of artificial intelligence: neural networks, which learn from data, and symbolic systems, which understand rules and logic. This blending can result in better performance than relying on one type alone.
  2. Despite being sidelined for years, recent evidence shows that using symbolic tools can significantly improve the effectiveness of AI systems. This suggests that the quiet resurgence of neurosymbolic AI could be key to future advancements.
  3. The industry's focus has largely been on scaling models powered by deep learning, which might not be enough for true AI progress. A more open approach that embraces neurosymbolic methods could lead to more breakthroughs and better results.
Marcus on AI 16441 implied HN points 28 Jun 25
  1. Generative AI struggles to create accurate models of the world. Without solid internal frameworks, they often get things wrong.
  2. Traditional AI uses clear and updateable world models for understanding, but current AI models like LLMs don't. This lack of structure leads to many errors in reasoning.
  3. Failures in AI, like making illegal moves in games or giving incorrect information, show that without proper world models, AI systems cannot reliably function.
arg min 297 implied HN points 04 Oct 24
  1. Using modularity, we can tackle many inverse problems by turning them into convex optimization problems. This helps us use simple building blocks to solve complex issues.
  2. Linear models can be a good approximation for many situations, and if we rely on them, we can find clear solutions to our inverse problems. However, we should be aware that they don't always represent reality perfectly.
  3. Different regression techniques, like ordinary least squares and LASSO, allow us to handle noise and sparse data effectively. Tuning the right parameters can help us balance accuracy and manageability in our models.
Arpitrage 470 implied HN points 09 Feb 26
  1. Finance work is mostly about processing large volumes of documents, and building pipelines to extract, index, and semantically understand those texts lets teams scale research, compliance, and automated actions. You still need provenance, governance, and clear workflows so those outputs are trustworthy.
  2. AI abilities are uneven: it can boost accuracy and productivity on tasks inside its capability frontier but can hurt performance outside that frontier, so humans need to stay engaged with clear roles (e.g., dividing work or iterating together). This also means guarding against cognitive complacency as tools get easier to use.
  3. Hallucinations are a core risk with LLMs, and the practical fix today is grounding models with retrieval-augmented generation (RAG) that pulls answers from a curated corpus. RAG reduces made-up claims but doesn't eliminate errors, so high-stakes outputs still require human verification.
High ROI Data Science 158 implied HN points 13 Oct 24
  1. AI is changing how we think about technology, moving beyond just improving what we have to creating entirely new ways to interact with it. This means businesses need to look for big, new opportunities, not just small tweaks.
  2. Having a strong data strategy is key for successful AI projects. This involves treating data as an important asset, gathering context, and making sure it's easy to access for training AI models.
  3. It's important to develop real, functional AI products that deliver clear value. Companies should focus on creating products that solve specific customer problems rather than just showing off cool technology.
Marcus on AI 16836 implied HN points 12 Jun 25
  1. Large reasoning models (LRMs) struggle with complex tasks, and while it's true that humans also make mistakes, we expect machines to perform better. The Apple paper highlights that LLMs can't be trusted for more complicated problems.
  2. Some rebuttals argue that bigger models might perform better, but we can't predict which models will succeed in various tasks. This leads to uncertainty about how reliable any model really is.
  3. Despite prior knowledge that these models generalize poorly, the Apple paper emphasizes the seriousness of the issue and shows that more people are finally recognizing the limitations of current AI technology.
Artificial Corner 119 implied HN points 16 Oct 24
  1. Reading is essential for understanding data science and machine learning. Books can help you learn these subjects from scratch or deepen your existing knowledge.
  2. One recommended book is 'Data Science from Scratch' by Joel Grus. It covers important math and statistics concepts that are crucial for data science.
  3. For beginners in Python, it's important to learn Python basics before diving into data science books. Supplement your reading with beginner-friendly Python books.
The Kaitchup – AI on a Budget 159 implied HN points 11 Oct 24
  1. Avoid using small batch sizes with gradient accumulation. It often leads to less accurate results compared to using larger batch sizes.
  2. Creating better document embeddings is important for retrieving information effectively. Including neighboring documents in embeddings can really help improve the accuracy of results.
  3. Aria is a new model that processes multiple types of inputs. It's designed to be efficient but note that it has a higher number of parameters, which means it might take up more memory.
Marcus on AI 9327 implied HN points 04 Aug 25
  1. AI slop refers to low-quality content generated by AI, which is spreading across various fields like journalism and science. This affects the reliability of information we receive.
  2. The term 'enshittification' describes how certain platforms are becoming filled with useless or misleading content, making it harder for users to find valuable information.
  3. As AI continues to be used more widely, the amount of inaccurate or low-quality information is growing, which is a significant concern for the future of communication and knowledge.
Marcus on AI 9762 implied HN points 27 Jul 25
  1. GPT-5 will be better than GPT-4, but it will still make many mistakes that are hard to predict. Users may find it tricky to control.
  2. Even with improvements, GPT-5 will struggle with complex reasoning and provide false information sometimes, which can be a problem for users counting on it.
  3. Real artificial general intelligence (AGI) won't come from just bigger models like GPT-5. We will need new designs that include better understanding and reasoning tools.
The Kaitchup – AI on a Budget 139 implied HN points 10 Oct 24
  1. Creating a good training dataset is key to making AI chatbots work well. Without quality data, the chatbot might struggle to perform its tasks effectively.
  2. Generating your own dataset using large language models can save time instead of collecting data from many different sources. This way, the data is tailored to what your chatbot really needs.
  3. Using personas can help you create specific question-and-answer pairs for the chatbot. It makes the training process more focused and relevant to various topics.
Don't Worry About the Vase 2464 implied HN points 28 Nov 25
  1. Claude Opus 4.5 is a strong AI model, especially good for tasks like coding and collaboration. It's noted for better alignment and safety than previous models.
  2. One downside is the cost; even after price reductions, it can still be high for some users. Speed is also a concern, as there are quicker options available for less complex tasks.
  3. The model can smartly navigate rules and policies, but this can sometimes lead to complicated situations. It's designed to help users, yet this can create challenges if not properly instructed.
Thái | Hacker | Kỹ sư tin tặc 2037 implied HN points 27 Jun 24
  1. The game of Diophantus, an ancient Greek mathematician, has had a lasting impact on cryptography and internet security, with the basis of elliptic curve cryptography originating from his mathematical puzzles.
  2. Diophantus's famous book 'Arithmetica' went missing for centuries but resurfaced to contribute to the advancements in mathematics, leading to significant discoveries like Fermat's Last Theorem.
  3. The study of elliptic curves, inspired by concepts like Kepler's study of ellipses, has become a central focus in mathematics, intersecting various branches like number theory, algebra, and geometry, and even impacting modern technology such as Bitcoin security.
The Intrinsic Perspective 31460 implied HN points 14 Nov 24
  1. AI development seems to have slowed down, with newer models not showing a big leap in intelligence compared to older versions. It feels like many recent upgrades are just small tweaks rather than revolutionary changes.
  2. Researchers believe that the improvements we see are often due to better search techniques rather than smarter algorithms. This suggests we may be returning to methods that dominated AI in earlier decades.
  3. There's still a lot of uncertainty about the future of AI, especially regarding risks and safety. The plateau in advancements might delay the timeline for achieving more advanced AI capabilities.
arg min 158 implied HN points 07 Oct 24
  1. Convex optimization has benefits, like collecting various modeling tools and always finding a reliable solution. However, not every problem fits neatly into a convex framework.
  2. Some complex problems, like dictionary learning and nonlinear models, often require nonconvex optimization, which can be tricky to handle but might be necessary for accurate results.
  3. Using machine learning methods can help solve inverse problems because they can learn the mapping from measurements to states, making it easier to compute solutions later, though training the model initially can take a lot of time.
Artificial Corner 138 implied HN points 09 Oct 24
  1. Python is a key language for AI because it has many useful libraries for tasks like data collection, cleaning, and visualization. Learning these libraries can help you work effectively on AI projects.
  2. For data collection, libraries like Requests and Beautiful Soup are useful for web scraping. If you need to handle JavaScript-driven sites, Selenium and Scrapy are great options.
  3. To visualize data, Matplotlib and Seaborn can help you create standard plots, while Plotly and Bokeh allow for interactive visualizations, making your data easier to understand.
Marcus on AI 9485 implied HN points 17 Jun 25
  1. A recent paper questions if large language models can really reason deeply, suggesting they struggle with even moderate complexity. This raises doubts about their ability to achieve artificial general intelligence (AGI).
  2. Some responses to this paper have been criticized as weak or even jokes, yet many continue to share them as if they are serious arguments. This shows confusion in the debate surrounding AI reasoning capabilities.
  3. New research supports the idea that AI systems perform poorly when faced with unfamiliar challenges, not just sticking to problems they are already good at solving.
Marcus on AI 6837 implied HN points 22 Jul 25
  1. DeepMind and OpenAI's AI systems scored impressively at the International Mathematical Olympiad, matching the scores of top human contestants. This shows they can solve complex math problems very well.
  2. Despite their success, the systems' actual impact on real mathematical research is uncertain. High scores in math contests don't always translate to breakthroughs in original math work.
  3. There are concerns about how OpenAI ran its tests and reported results, as they didn't disclose methods as thoroughly as DeepMind did. This raises questions about the reliability of their achievements.
The Century of Biology 1416 implied HN points 23 Nov 25
  1. The biotech industry is seeing a shift towards using AI technologies. This is creating new opportunities for businesses that provide AI tools and infrastructure rather than just focusing on drug development.
  2. AI can potentially replace traditional experiments in biology, speeding up research and reducing costs. This allows scientists to explore many more ideas and possibilities without being limited by the physical experimentation process.
  3. Investing in AI infrastructure for biotech could lead to significant advancements and financial returns. If companies successfully scale their AI solutions, they could capture a big slice of the growing biotech market.
Data Science Weekly Newsletter 119 implied HN points 12 Sep 24
  1. Understanding AI interpretability is important for building resilient systems. We need to focus on why interpretability matters and how it relates to AI's resilience.
  2. Testing machine learning systems can be challenging, but starting with basic best practices like CI pipelines and E2E testing can help. This ensures the models work well in real-world scenarios.
  3. Visualizing machine learning models is crucial for better understanding and analysis. Tools like Mycelium can help create clear visual representations of complex data structures.