The hottest Data science Substack posts right now

And their main takeaways

WeKnow-RAG

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 16 Aug 24

🕹 Technology AI Data science Machine Learning Natural Language Processing Information Retrieval

WeKnow-RAG uses a smart approach to gather information that mixes simple facts from its knowledge base with data found on the web. This helps improve the accuracy of answers given to users.
This system includes a self-check feature, which allows it to assess how confident it is in the information it provides. This helps to reduce mistakes and improve quality.
Knowledge Graphs are important because they organize information in a clear way, allowing the system to find the right data quickly and effectively, no matter what type of question is asked.

American Economic History: Introduction

Brad DeLong's Grasping Reality • 253 implied HN points • 22 Jan 25

🚌 Education History Economics Teaching Analysis Data science

The course will focus on American economic history without trying to create a single, simple story. Instead, it will look at different themes and questions week by week.
An important question will be whether America is exceptional and in what ways. This can help us better understand history and economics.
Students will not only learn about historical events but also get a taste of data science to analyze economic models and improve their analytical skills.

Inductive biases - a better way to think about machine learning?

Mindful Modeler • 159 implied HN points • 11 Jun 24

🕹 Technology Machine Learning Data science Book Writing

Hyperparameter settings can drastically change inductive biases within machine learning models.
Machine learning algorithms represent a collection of inductive biases that influence model outcomes.
Understanding inductive biases is crucial for comprehending the robustness, interpretability, and plausibility of machine learning models.

Do Prediction Technologies Help Novices or Experts More?

New Things Under the Sun • 224 implied HN points • 27 Jan 25

🕹 Technology AI Innovation Research Data science

AI can help both beginners and experts, but it depends on the tasks they are working on. Sometimes, beginners gain more because AI levels the playing field.
In some cases, experts benefit more from AI. They can solve complex problems that AI cannot, while beginners still struggle with those.
Prediction tools can make a big difference in innovation fields like mining and drug discovery. The impact varies based on expertise and the types of problems being addressed.

Creating Synthetic Training Data

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 59 implied HN points • 01 Aug 24

🕹 Technology Artificial Intelligence Data science Machine Learning Natural Language Processing Software Development

Creating synthetic data is hard because it's not just about making more data; it also needs to be diverse and varied. It's tough to make sure there are enough different examples.
Using a seed corpus can limit how varied the synthetic data is. If the starting data isn't diverse, the generated data won't be either.
A new approach called Persona Hub uses a billion different personas to create varied synthetic data. This helps in generating high-quality, interesting content across various situations.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Issue #9 - Clarifying Data Terminology

The Data Ecosystem • 159 implied HN points • 09 Jun 24

🕹 Technology Data Management Data Analysis Data Governance Data Literacy Data science

Data can mean many things, from raw collections to curated evidence used in decisions. It's important to define what data means in each situation to avoid confusion.
Poorly defined data terms can lead to problems in data literacy, collection, and management. This can create issues for organizations trying to use data effectively.
Understanding different categories of data, like data types and processing stages, helps in managing and analyzing data better. Knowing these categories makes it easier to communicate and use data in an organization.

Data Science Weekly - Issue 556

Data Science Weekly Newsletter • 79 implied HN points • 18 Jul 24

🕹 Technology Data science Artificial Intelligence Machine Learning Programming Data Engineering

AI research in China is progressing rapidly, but it hasn't received much attention compared to developments in the US. There are many complexities in understanding the implications of this advancement.
There are new methods to improve large language models (LLMs) using production data, which can enhance their performance over time. A structured approach to analyzing data quality can lead to better outcomes.
Evaluating modern machine learning models can be challenging, leading to some questionable research practices. It's important to understand these issues to ensure more accurate and reproducible results.

ModernBERT, the BERT of 2024

Gonzo ML • 63 implied HN points • 19 Dec 24

🕹 Technology AI Machine Learning Natural Language Processing Computing Data science

ModernBERT is a new version of BERT that improves processing speed and memory efficiency. It can handle longer contexts and makes BERT more practical for today's tasks.
The architecture of ModernBERT has been updated with features that enhance performance, like better attention mechanisms and optimized computations. This means it works faster and can process more data at once.
ModernBERT has shown impressive results in various natural language understanding tasks and can compete well against larger models, making it an exciting tool for developers and researchers.

💥 Tech Talks Weekly #28

Tech Talks Weekly • 59 implied HN points • 22 Aug 24

🕹 Technology Conferences Software Data science Artificial Intelligence Web Development

There are lots of new tech talks available from various conferences, making it easier to stay updated with the latest in technology.
You can help shape future content by filling out a quick feedback form, which takes less than 30 seconds.
Tech Talks Weekly offers a free subscription to help reduce the clutter of tech talk content and keep readers informed.

What is an AI recipe?

Sunday Letters • 19 implied HN points • 01 Sep 24

🕹 Technology AI Software Innovation Data science Programming

An AI recipe is a mix of code and AI thinking that helps solve problems. It's not just code or just prompts; it's a combination that guides the AI to achieve a goal.
Finding the right balance between structured code and flexible AI is tricky. This balance can feel similar to figuring out what makes a cake a cake.
As AI improves, the aim is to make these recipes work better and help connect human ideas directly to machine actions.

The CAP Theorem of Clustering: Why Every Algorithm Must Sacrifice Something

Confessions of a Code Addict • 529 implied HN points • 29 Oct 24

🕹 Technology Algorithms Data science Software Engineering Mathematics Research

Clustering algorithms can never be perfect and always require trade-offs. You can't have everything, so you have to choose what matters most for your project.
There are three key properties that clustering should ideally have: scale-invariance, richness, and consistency, but no algorithm can achieve all three simultaneously.
Understanding these sacrifices helps in making better decisions when using clustering methods. Knowing what to prioritize can lead to more effective data analysis.

The latest open artifacts (#7): Alpaca era of reasoning models, China's continued dominance, and tons of multimodal advancements

Democratizing Automation • 150 implied HN points • 19 Feb 25

🕹 Technology AI Machine Learning Open Source Data science Model development

New datasets for deep learning models are appearing, but choosing the right one can be tricky.
China is leading in AI advancements by releasing strong models with easy-to-use licenses.
Many companies are developing reasoning models that improve problem-solving by using feedback and advanced training methods.

Tülu 3: The next era in open post-training

Democratizing Automation • 404 implied HN points • 21 Nov 24

🕹 Technology AI Machine Learning Open Source Data science Software Development

Tulu 3 introduces an open-source approach to post-training models, allowing anyone to improve large language models like Llama 3.1 and reach performance similar to advanced models like GPT-4.
Recent advances in preference tuning and reinforcement learning help achieve better results with well-structured techniques and new synthetic datasets, making open post-training more effective.
The development of these models is pushing the boundaries of what can be done in language model training, indicating a shift in focus towards more innovative training methods.

Data Science Weekly - Issue 549

Data Science Weekly Newsletter • 159 implied HN points • 31 May 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Cloud Computing Software Development

Mediocre machine learning can be very risky for businesses, as it may lead to significant financial losses. Companies need to ensure their ML products are reliable and efficient.
Understanding logistic regression can be made easier by using predicted probabilities. This approach helps in clearly presenting data analysis results, especially to those who may not be familiar with technical terms.
Data quality management is becoming essential in today's data-driven world. It's important to keep track of how data is tested and monitored to maintain trust and accuracy in business decisions.

Make softmax great again

Gonzo ML • 126 implied HN points • 06 Nov 24

🕹 Technology Artificial Intelligence Machine Learning Data science Neural Networks Transformers

Softmax is widely used in machine learning, especially in transformers, to turn numbers into probabilities. However, it struggles when dealing with new kinds of data that the model hasn't seen before.
The sharpness of softmax can fade when there's a lot of input data. This means it sometimes can't make clear predictions about which option is best in bigger datasets.
To improve softmax, researchers suggest using 'adaptive temperature.' This idea helps make the predictions sharper based on the data being processed, leading to better performance in some tasks.

The Sequence Opinion #480: What is GPT-o1 Actually Doing?

TheSequence • 161 implied HN points • 30 Jan 25

🕹 Technology AI Machine Learning Deep Learning Software Development Data science

GPT models are becoming more advanced in reasoning and problem-solving, not just generating text. They are now synthesizing programs and refining their results.
There's a focus on understanding how these models work internally through ideas like hypothesis search and program synthesis. This helps in grasping the real innovation they bring.
Reinforcement learning is a key technique used by newer models to improve their outputs. This shows that they are evolving and getting better at what they do.

AI Is A Car That Everyone Expects To Be A Spaceship

Beekey’s Substack • 59 implied HN points • 24 Jul 24

🕹 Technology AI Machine Learning Data science Automation Software Development

AI has made great improvements, especially with tasks that involve generating human-like responses and art. However, many people are getting carried away with the hype about its capabilities.
Machine learning allows AI to recognize patterns in data, but it doesn't actually understand content like a human does. This means it can make mistakes that a human wouldn't.
The idea of creating Artificial General Intelligence (AGI) from current AI is questionable because we still don't fully understand how human intelligence works. It's not just about being faster; something fundamental is still missing.

Data Science Weekly - Issue 553

Data Science Weekly Newsletter • 99 implied HN points • 27 Jun 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

Data visualization can show important patterns, like changes in night and daylight globally. Understanding these trends helps us appreciate our environment better.
In AI engineering, simplifying data preparation is crucial. Many new AI applications can be built without structured data, which might lead to rushed expectations about their effectiveness.
Aquaculture technology is evolving with better methods to track and analyze fish behavior. New approaches like deep learning are making monitoring more accurate and efficient.

Data Science Weekly - Issue 547

Data Science Weekly Newsletter • 179 implied HN points • 17 May 24

🕹 Technology Data science AI Machine Learning Data Visualization Software Development

Learning Rust programming can be made easy with exercises designed for beginners, even if you know another language already. You’ll work through small tasks to build confidence.
Data scientists need to learn how to work with databases to scale their analytics. Many face challenges when transitioning to this part of their work.
There are helpful tools, like Data Wrangler for VS Code, that simplify data cleaning and analysis. These tools help generate code automatically as you work with your data.

Data Science Weekly - Issue 541

Data Science Weekly Newsletter • 279 implied HN points • 05 Apr 24

🕹 Technology Data science AI Machine Learning Software Development Data Engineering

AI agents have unique challenges that traditional laws may not effectively solve. New rules and systems are needed to ensure they are managed properly.
JS-Torch is a new JavaScript library that makes deep learning easier for developers familiar with PyTorch. It allows building and training neural networks directly in the browser.
Data acquisition is crucial for AI start-ups to succeed. There are strategies outlined to help these businesses gather the right data efficiently.

DeepSeek V3 and R1

From the New World • 188 implied HN points • 28 Jan 25

🕹 Technology AI Machine Learning Computing Innovation Data science

DeepSeek has released a new AI model called R1, which can answer tough scientific questions. This model has quickly gained attention, competing with major players like OpenAI and Google.
There's ongoing debate about the authenticity of DeepSeek's claimed training costs and performance. Many believe that its reported costs and results might not be completely accurate.
DeepSeek has implemented several innovations to enhance its AI models. These optimizations have helped them improve performance while dealing with hardware limits and developing new training techniques.

Data Science Weekly - Issue 543

Data Science Weekly Newsletter • 219 implied HN points • 19 Apr 24

🕹 Technology Data science Machine Learning AI Analytics Data Engineering

Statistical ideas have a big impact on the world. Learning about important papers can help us understand how statistics shape modern research and decision-making.
Machine Learning teams have different roles that face unique challenges. Understanding these personas can help leaders support their teams better.
Using vector embeddings can greatly improve search experiences in apps. They simplify processes that previously seemed too complex and highlight their usefulness in technology.

Looking back on AI in 2024

Generating Conversation • 46 implied HN points • 19 Dec 24

🕹 Technology AI Machine Learning Software Development Data science

AI companies need to show clear value to succeed. This means saving money or making profits, not just improving productivity.
Building customer trust is key for AI products. Letting customers test and experience the product firsthand is often more effective than complicated evaluation tools.
User experience with AI tools is really important. Good AI needs to be easy and enjoyable to use, which is a challenge that still needs solving.

Don't "fix" your imbalanced data

Mindful Modeler • 818 implied HN points • 05 Sep 23

🕹 Technology Data science Machine Learning

Avoid trying to fix imbalanced data through sampling methods like oversampling or undersampling. It can distort your model's calibration and reduce information for the majority class.
SMOTE, a common method for imbalanced data, works well only with weak classifiers, not strong ones. It may not be suitable if calibration is crucial for your model.
Consider doing nothing when faced with imbalanced data as a default strategy. Sometimes in machine learning, less is more.

to kaggle, or not to kaggle

Mindful Modeler • 379 implied HN points • 13 Feb 24

🕹 Technology Machine Learning Modeling Competitions Data science AI

There are conflicting views on Kaggle - some see it as a playground while others believe it produces top machine learning results.
Participating in Kaggle competitions can be beneficial to learn core supervised machine learning concepts.
The decision to focus on Kaggle competitions should depend on how much daily tasks align with Kaggle-style work.

How to get from evaluation to final model

Mindful Modeler • 279 implied HN points • 19 Mar 24

🕹 Technology Machine Learning Data science Model Deployment Model Evaluation

When moving from model evaluation to the final model, there are various approaches with trade-offs.
Options include using all data for training the final model with best hyperparameters, deploying an ensemble of models, or a lazy approach of choosing one from cross-validation.
Each approach like inside-out, parameter donation, or ensemble has its pros and cons, highlighting the complexity of transitioning from evaluation to the final model.

Data Science Weekly - Issue 548

Data Science Weekly Newsletter • 139 implied HN points • 24 May 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Visualization Data Engineering

Good communication is key for statisticians to explain their complex work to non-experts. Finding ways to relate data to everyday situations can make it easier for others to understand.
Using histograms can speed up the training process for gradient boosted machines in data science. This simple technique can improve efficiency significantly.
There are efforts to use machine learning algorithms to detect type 1 diabetes in children earlier. This can help avoid serious health issues by improving recognition of symptoms.

June Newsletter

RSS DS+AI Section • 5 implied HN points • 01 Jun 25

🕹 Technology Data science Artificial Intelligence Machine Learning Ethics Research

Ethics and bias in AI are big topics right now. Many people are talking about how to keep AI safe and fair as it becomes more advanced.
There are many exciting developments in AI research, including new tools and methods. For example, some AI can now create new algorithms and even assist in healthcare.
Real-world applications of AI are growing, with many helpful resources and tutorials available. It's becoming easier for people to use AI for practical tasks and projects.

Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation Models

TheSequence • 189 implied HN points • 29 Dec 24

🕹 Technology Artificial Intelligence Machine Learning Neural Networks Modeling Data science

Artificial intelligence is moving from preference tuning to reward optimization for better alignment with human values. This change aims to improve how models respond to our needs.
Preference tuning has its limits because it can't capture all the complexities of human intentions. Researchers are exploring new reward models to address these limitations.
Recent models like GPT-o3 and Tülu 3 showcase this evolution, showing how AI can become more effective and nuanced in understanding and generating language.

📝 Guest Post: Augmented SBERT: A Data Augmentation Method to Enhance Bi-Encoders for Pairwise Sentence Scoring*

TheSequence • 126 implied HN points • 31 Jan 25

🕹 Technology Natural Language Processing Data science Machine Learning Artificial Intelligence Software Development

Augmented SBERT (AugSBERT) improves sentence scoring tasks by using data augmentation to create more sentence pairs. This means it can perform better even when there's not much training data available.
Traditional methods like cross-encoders and bi-encoders have limitations, like being slow or needing a lot of data. AugSBERT addresses these issues, making it more efficient for large-scale tasks.
The approach combines the strengths of different models to enhance performance, especially in specific domains. It shows significant improvements over existing models, making it a useful tool for various natural language processing applications.

LLM Links, 11/21

In My Tribe • 273 implied HN points • 21 Nov 24

🕹 Technology AI Machine Learning Robotics Data science Innovation

There's a debate about AI progress. Some experts think AI models are hitting a limit and may not get much smarter, while others believe we will continue to see significant advancements.
While machine learning can learn from explicit knowledge, it struggles with understanding deeper, unspoken human knowledge. This limitation might prevent AI from reaching the same expertise as human experts.
AI technologies are still showing exciting developments, like robots learning to perform surgeries by watching videos. This points to the potential for AI to revolutionize fields like medicine.

Data Science Weekly - Issue 539

Data Science Weekly Newsletter • 259 implied HN points • 22 Mar 24

🕹 Technology Data science AI Machine Learning Data Engineering Data Visualization

Data storytelling is important for sharing insights, and AI can help people create better stories. The research looks at how different tools assist in each storytelling stage.
Switching from R to Python in data science isn't just about learning new syntax; it's a mindset change. New Python tools can help make this transition smoother for users coming from R's tidyverse.
Emerging technologies often face skepticism, as seen throughout history. New inventions have raised concerns about their impact, but they eventually become part of everyday life.

Data Science Weekly - Issue 532

Data Science Weekly Newsletter • 379 implied HN points • 02 Feb 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Data Visualization

Forecasting in data science is challenging because time series data can be non-stationary. Using the right evaluation methods can help bridge the gap between traditional and modern forecasting techniques.
It's important to consider the smartness of your data structures. Creating overly complicated dashboards that ultimately just produce simple outputs may not be the best use of time.
There are clear distinctions between well-built data pipelines and amateur setups. Understanding what makes a pipeline production-grade can improve the quality and reliability of data processing.

What’s after GPT-5?

Sector 6 | The Newsletter of AIM • 419 implied HN points • 18 Jan 24

🕹 Technology Artificial Intelligence Machine Learning Software Development Digital innovation Data science

OpenAI is always working on better models to improve AI, and this journey is continuous.
The upcoming GPT-5 model will allow AI to process and create video content.
AI will become capable of completing complex tasks, which will help increase productivity for users.

Data Science Weekly - Issue 533

Data Science Weekly Newsletter • 339 implied HN points • 09 Feb 24

🕹 Technology Data science Machine Learning Artificial Intelligence AI Research Data Engineering

Satellite data is important for machine learning and should be treated as a unique area of research. Recognizing this can help improve how we use this data.
Many data science and machine learning projects fail from the start due to common mistakes. Learning from past experiences can help increase the chances of success.
Open source software plays a crucial role in advancing AI technology. It's important to support and protect open source AI from regulations that could harm its progress.

Security is Not an AI Problem

Security Is • 159 implied HN points • 02 May 24

🕹 Technology AI Cybersecurity Machine Learning Data science Digital Tools

AI doesn't really fix security problems well. Many times, the technology just doesn't work in the tough, unpredictable environments that security deals with.
The best results in security often come from simple, clear procedures, not from complex machine learning models. Basic rules can solve most problems effectively.
Generative AI can help with minor tasks but isn't a magic solution for security. It might even confuse people about important issues, rather than clarify them.

OLMo 2 and building effective teams for training language models

Democratizing Automation • 245 implied HN points • 26 Nov 24

🕹 Technology AI Machine Learning Software Development Data science Open Source

Effective language model training needs attention to detail and technical skills. Small issues can have complex causes that require deep understanding to fix.
As teams grow, strong management becomes essential. Good managers can prioritize the right tasks and keep everyone on track for better outcomes.
Long-term improvements in language models come from consistent effort. It’s important to avoid getting distracted by short-term goals and instead focus on sustainable progress.

AI Roundup 097: Model Mayhem

Artificial Ignorance • 46 implied HN points • 13 Dec 24

🕹 Technology AI Models Machine Learning Software Development Data science Tech Companies

Google has launched new AI models such as Gemini 2.0, which can create text, images, and audio quickly. They also introduced tools to summarize video content and help users with web tasks.
OpenAI released several features, including a text-to-video model named Sora for paying users. They also improved ChatGPT's digital editing tool and added new voice capabilities for video interactions.
Meta and other companies are also advancing in AI with new models for cheaper yet effective performance and tools for watermarking AI-generated videos, showing that competition in AI is heating up.

Data Science Weekly - Issue 544

Data Science Weekly Newsletter • 159 implied HN points • 26 Apr 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Data Visualization

Evaluating AI models can be expensive, but tools like lm-buddy and Prometheus help do it on cheaper hardware without high costs.
Installing and deploying LLaMA 3 is made simple with clear guides that cover everything from setup to scaling effectively.
Understanding best practices in machine learning is essential, and resources like the 'Rules of Machine Learning' provide valuable guidelines for beginners.

The types of AI, visually explained 🤖

Year 2049 • 15 implied HN points • 16 Jan 25

🕹 Technology AI Machine Learning Data science Robotics Automation

AI comes in different types, and it's good to know what they are. Understanding the types helps us see how AI works in our daily lives.
Machines learn to become intelligent over time, which is fascinating. This process is important to understand how AI evolves.
It's helpful to share knowledge about AI with others. Teaching friends and family can make everyone more aware of how AI impacts us.