The hottest Data science Substack posts right now

And their main takeaways

AI #97: 4

Don't Worry About the Vase • 2419 implied HN points • 02 Jan 25

🕹 Technology AI Machine Learning Data science Automation Software Development

AI is becoming more common in everyday tasks, helping people manage their lives better. For example, using AI to analyze mood data can lead to better mental health tips.
As AI technology advances, there are concerns about job displacement. Jobs in fields like science and engineering may change significantly as AI takes over routine tasks.
The shift of AI companies from non-profit to for-profit models could change how AI is developed and used. It raises questions about safety, governance, and the mission of these organizations.

AI #98: World Ends With Six Word Story

Don't Worry About the Vase • 1881 implied HN points • 09 Jan 25

🕹 Technology Artificial Intelligence Machine Learning Data science Automation Digital Transformation

AI can offer useful tasks, but many people still don't see its value or know how to use it effectively. It's important to change that mindset.
Companies are realizing that fixed subscription prices for AI services might not be sustainable because usage varies greatly among users.
Many folks are worried about AI despite not fully understanding it. It's crucial to communicate AI's potential benefits and reduce fears around job loss and other concerns.

I spent 6 hours learning how Apache Spark plans the execution for us

VuTrinh. • 659 implied HN points • 10 Sep 24

🕹 Technology Data science Software Engineering Big Data Cloud Computing Machine Learning

Apache Spark uses a system called Catalyst to plan and optimize how data is processed. This system helps make sure that queries run as efficiently as possible.
In Spark 3, a feature called Adaptive Query Execution (AQE) was added. It allows the tool to change its plans while a query is running, based on real-time data information.
Airbnb uses this AQE feature to improve how they handle large amounts of data. This lets them dynamically adjust the way data is processed, which leads to better performance.

The Weekly Kaitchup #64

The Kaitchup – AI on a Budget • 59 implied HN points • 25 Oct 24

🕹 Technology AI Machine Learning Software Data science Cloud Computing

Qwen2.5 models have been improved and now come in a 4-bit version, making them efficient for different hardware. They perform better than previous models on many tasks.
Google's SynthID tool can add invisible watermarks to AI-generated text, helping to identify it without changing the text's quality. This could become a standard practice to distinguish AI text from human writing.
Cohere has launched Aya Expanse, new multilingual models that outperform many existing models. They took two years to develop, involving thousands of researchers, enhancing language support and performance.

July RSS AI and Data Science newsletter - anything to contribute?

RSS DS+AI Section • 5 implied HN points • 21 Jun 25

🕹 Technology AI Data science Newsletters Publications Meetups

The next AI and Data Science newsletter will be sent out in early July.
If you have anything to share, like announcements or job openings, please send it directly to the author.
Contributions are welcome from everyone in the community, so don't hesitate to participate.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

AI #96: o3 But Not Yet For Thee

Don't Worry About the Vase • 2598 implied HN points • 26 Dec 24

🕹 Technology AI Software Innovation Data science Computing

The new AI model, o3, is expected to improve performance significantly over previous models and is undergoing safety testing. We need to see real-world results to know how useful it truly is.
DeepSeek v3, developed for a low cost, shows promise as an efficient AI model. Its performance could shift how AI models are built and deployed, depending on user feedback.
Many users are realizing that using multiple AI tools together can produce better results, suggesting a trend of combining various technologies to meet different needs effectively.

LLMs and World Models, Part 2

AI: A Guide for Thinking Humans • 196 implied HN points • 13 Feb 25

🕹 Technology AI Machine Learning Neural Networks Data science Computing

LLMs (like OthelloGPT) may have learned to represent the rules and state of simple games, which suggests they can create some kind of world model. This was tested by analyzing how they predict moves in the game Othello.
While some researchers believe these models are impressive, others think they are not as advanced as human thinking. Instead of forming clear models, LLMs might just use many small rules or heuristics to make decisions.
The evidence for LLMs having complex, abstract world models is still debated. There are hints of this in controlled settings, but they might just be using collections of rules that don't easily adapt to new situations.

Train and Serve an AI Chatbot Based on Llama 3.2

The Kaitchup – AI on a Budget • 179 implied HN points • 17 Oct 24

🕹 Technology AI Chatbots Machine Learning Data science Software Development

You can create a custom AI chatbot easily and cheaply now. New methods make it possible to train smaller models like Llama 3.2 without spending much money.
Fine-tuning a chatbot requires careful preparation of the dataset. It's important to learn how to format your questions and answers correctly.
Avoiding common mistakes during training is crucial. Understanding these pitfalls will help ensure your chatbot works well after it's trained.

Breaking: OpenAI's efforts at pure scaling have hit a wall.

Marcus on AI • 7825 implied HN points • 13 Feb 25

🕹 Technology AI Machine Learning Software Development Data science Innovation

OpenAI's plan to just make bigger AI models isn't working anymore. They need to find new ways to improve AI instead of just adding more data and parameters.
The new version, originally called GPT-5, has been downgraded to GPT 4.5. This shows that the project hasn't met expectations and isn't a big step forward.
Even if pure scaling isn't the answer, AI development will continue. There are still many ways to create smarter AI beyond just making models larger.

Five ways in which the last 3 months — and especially the DeepSeek era — have vindicated “Deep learning is hitting a wall"

Marcus on AI • 7074 implied HN points • 09 Feb 25

🕹 Technology AI Machine Learning Deep Learning Data science

Just adding more data to AI models isn't enough to achieve true artificial general intelligence (AGI). New techniques are necessary for real advancements.
Combining neural networks with traditional symbolic methods is becoming more popular, showing that blending approaches can lead to better results.
The competition in AI has intensified, making large language models somewhat of a commodity. This could change how businesses operate in the generative AI market.

Landscape of Sequencing-based Spatial RNA Technology

LatchBio • 15 implied HN points • 27 Feb 25

🕹 Technology Biotechnology Data science Genomics Innovation Health tech

Spatial RNA technology helps us see how cells interact in their natural environment. It gives a clearer picture than traditional methods that just show gene activity without their locations.
There are many ways to capture and analyze spatial gene data, like using specially barcoded slides or microfluidic methods. Each approach has its pros and cons depending on what researchers want to study.
Advancements in technology are making it possible to analyze tiny details, like individual cells or even parts of cells. This opens new doors for understanding biology and diseases.

Interpolation Is All You Need

arg min • 317 implied HN points • 08 Oct 24

🕹 Technology AI Optimization Machine Learning Data science

Interpolation is a process where we find a function that fits a specific set of input and output points. It's a useful tool for solving problems in optimization.
We can build more complex function fitting problems by combining simple interpolation constraints. This allows for greater flexibility in how we define functions.
Duality in convex optimization helps solve interpolation problems, enabling efficient computation and application in areas like machine learning and control theory.

Apache Iceberg Isn't Coming To Save You

SeattleDataGuy’s Newsletter • 341 implied HN points • 27 May 25

🕹 Technology Data science Data Engineering Software Development Information Systems Cloud Computing

Apache Iceberg might seem appealing, but it won't automatically solve your data problems. It's important to really understand what issues you're trying to address before jumping in.
Switching to new tools like Iceberg won't fix a broken data strategy. The focus should be on delivering real business value, not just adopting the latest technology.
If your data team is already doing well and looking to improve, Iceberg could be useful. But make sure it's the right fit for your specific challenges instead of following trends.

AI #94: Not Now, Google

Don't Worry About the Vase • 2464 implied HN points • 12 Dec 24

🕹 Technology AI Software Data science Cybersecurity Machine Learning

AI technology is rapidly improving, with many advancements happening from various companies like OpenAI and Google. There's a lot of stuff being developed that allows for more complex tasks to be handled efficiently.
People are starting to think more seriously about the potential risks of advanced AI, including concerns related to AI being used in defense projects. This brings up questions about ethics and the responsibilities of those creating the technology.
AI tools are being integrated into everyday tasks, making things easier for users. People are finding practical uses for AI in their lives, like getting help with writing letters or reading books, making AI more useful and accessible.

Multilingual Embeddings, Safer LLMs, and Log-Linear Attention

ppdispatch • 2 implied HN points • 13 Jun 25

🕹 Technology AI Machine Learning Data science Natural Language Processing Computer Science

There's a new multilingual text embedding benchmark called MMTEB that covers over 500 tasks in more than 250 languages. A smaller model surprisingly outperforms much larger ones.
Saffron-1 is a new method designed to make large language models safer and more efficient, especially in resisting attacks.
Harvard released a massive dataset of 242 billion tokens from public domain books, which can help in training language models more effectively.

OpenAI’s biggest worry isn’t DeepSeek

Enterprise AI Trends • 253 implied HN points • 31 Jan 25

🕹 Technology AI Machine Learning Data science Computing Innovation

DeepSeek's release showed that simple reinforcement learning can create smart models. This means you don't always need complicated methods to achieve good results.
Using more computing power can lead to better outcomes when it comes to AI results. DeepSeek's approach hints at cost-saving methods for training large models.
OpenAI is still a major player in the AI field, even though some people think DeepSeek and others will take over. OpenAI's early work has helped it stay ahead despite new competition.

The Sequence Opinion #557: Millions of GPUs, Zero Understanding: The Cost of AI Interpretability

TheSequence • 49 implied HN points • 05 Jun 25

🕹 Technology AI Machine Learning Computing Data science Cybersecurity

AI models are becoming super powerful, but we don't fully understand how they work. Their complexity makes it hard to see how they make decisions.
There are new methods being explored to make these AI systems more understandable, including using other AI to explain them. This is a fresh approach to tackle AI interpretability.
The debate continues about whether investing a lot of resources into understanding AI is worth it compared to other safety measures. We need to think carefully about what we risk if we don't understand these machines better.

CONFIRMED: LLMs have indeed reached a point of diminishing returns

Marcus on AI • 13754 implied HN points • 09 Nov 24

🕹 Technology AI Trends Machine Learning Data science Generative AI

LLMs, or large language models, are hitting a point where adding more data and computing power isn't leading to better results. This means companies might not see the improvements they hoped for.
The excitement around generative AI may fade as reality sets in, making it hard for companies like OpenAI to justify their high valuations. This could lead to a financial downturn in the AI industry.
There is a need to explore other AI approaches since relying too heavily on LLMs might be a risky gamble. It might be better to rethink strategies to achieve reliable and trustworthy AI.

Claude 4 and Anthropic's bet on code

Democratizing Automation • 324 implied HN points • 27 May 25

🕹 Technology AI Models Software Engineering Machine Learning Data science Tech industry

Claude 4 is a strong AI model from Anthropic, focused on coding and software tasks. It has a unique personality and improved performance over its predecessors.
The benchmarks for Claude 4 might not look impressive compared to others like ChatGPT and Gemini, which could affect its market position. It's crucial for Anthropic to show real-world utility beyond just numbers.
Anthropic aims to lead in software development, but they fall behind in general benchmarks. This may limit their ability to compete with bigger players like OpenAI and Google in the race for advanced AI.

15 Times to use AI, and 5 Not to

One Useful Thing • 2226 implied HN points • 09 Dec 24

🕹 Technology AI Machine Learning Data science Automation Software Development

AI is great for generating lots of ideas quickly. Instead of getting stuck after a few, you can use AI to come up with many different options.
It's helpful to use AI when you have expertise and can easily spot mistakes. You can rely on it to assist with complex tasks without losing track of quality.
However, be cautious using AI for learning or where accuracy is critical. It may shortcut your learning and sometimes make errors that are hard to notice.

The latest open artifacts (#10): More permissive licenses, everything as a reasoner, and from artifacts to agents

Democratizing Automation • 277 implied HN points • 29 May 25

🕹 Technology AI Models Open Source Licensing Data science Machine Learning

There is a rise in Chinese AI models that use more open licenses, influencing other models to adopt similar practices. This pressure is especially affecting Western companies like Meta and Google.
Qwen models are becoming more popular for fine-tuning compared to Llama models, with smaller American startups favoring Qwen. These trends show a shift in preferences in the AI community.
The focus in AI is shifting from just model development to creating tools that leverage these models. This means future releases will often be tool-based rather than just about the AI models themselves.

Why I don’t share Sam Altman’s confidence that AGI is basically a solved problem

Marcus on AI • 7786 implied HN points • 06 Jan 25

🕹 Technology Artificial Intelligence Machine Learning Data science Computing Robotics

AGI is still a big challenge, and not everyone agrees it's close to being solved. Some experts highlight many existing problems that have yet to be effectively addressed.
There are significant issues with AI's ability to handle changes in data, which can lead to mistakes in understanding or reasoning. These distribution shifts have been seen in past research.
Many believe that relying solely on large language models may not be enough to improve AI further. New solutions or approaches may be needed instead of just scaling up existing methods.

An Inversion Cookbook

arg min • 297 implied HN points • 04 Oct 24

🚌 Education Mathematics Statistics Optimization Regression Data science

Using modularity, we can tackle many inverse problems by turning them into convex optimization problems. This helps us use simple building blocks to solve complex issues.
Linear models can be a good approximation for many situations, and if we rely on them, we can find clear solutions to our inverse problems. However, we should be aware that they don't always represent reality perfectly.
Different regression techniques, like ordinary least squares and LASSO, allow us to handle noise and sparse data effectively. Tuning the right parameters can help us balance accuracy and manageability in our models.

8 Best Practices for AI Strategy Development By An AI—Strategist

High ROI Data Science • 158 implied HN points • 13 Oct 24

🕹 Technology AI Strategy Data science Innovation Digital Transformation Business Intelligence

AI is changing how we think about technology, moving beyond just improving what we have to creating entirely new ways to interact with it. This means businesses need to look for big, new opportunities, not just small tweaks.
Having a strong data strategy is key for successful AI projects. This involves treating data as an important asset, gathering context, and making sure it's easy to access for training AI models.
It's important to develop real, functional AI products that deliver clear value. Companies should focus on creating products that solve specific customer problems rather than just showing off cool technology.

AI & Python #27: Books I Read to Learn Data Science and Machine Learning

Artificial Corner • 119 implied HN points • 16 Oct 24

🕹 Technology Data science Machine Learning Artificial Intelligence Education

Reading is essential for understanding data science and machine learning. Books can help you learn these subjects from scratch or deepen your existing knowledge.
One recommended book is 'Data Science from Scratch' by Joel Grus. It covers important math and statistics concepts that are crucial for data science.
For beginners in Python, it's important to learn Python basics before diving into data science books. Supplement your reading with beginner-friendly Python books.

The Weekly Kaitchup #62

The Kaitchup – AI on a Budget • 159 implied HN points • 11 Oct 24

🕹 Technology AI Machine Learning Hardware Software Data science

Avoid using small batch sizes with gradient accumulation. It often leads to less accurate results compared to using larger batch sizes.
Creating better document embeddings is important for retrieving information effectively. Including neighboring documents in embeddings can really help improve the accuracy of results.
Aria is a new model that processes multiple types of inputs. It's designed to be efficient but note that it has a higher number of parameters, which means it might take up more memory.

AI #93: Happy Tuesday

Don't Worry About the Vase • 1971 implied HN points • 04 Dec 24

🕹 Technology AI Machine Learning Data science Open Source Cybersecurity

Language models can be really useful in everyday tasks. They can help with things like writing, translating, and making charts easily.
There are serious concerns about AI safety and misuse. It's important to understand and mitigate risks when using powerful AI tools.
AI technology might change the job landscape, but it's also essential to consider how it can enhance human capabilities instead of just replacing jobs.

AI still lacks “common” sense, 70 years later

Marcus on AI • 5968 implied HN points • 05 Jan 25

🕹 Technology Artificial Intelligence Machine Learning Robotics Cognitive Science Data science

AI struggles with common sense. While humans easily understand everyday situations, AI often fails to make the same connections.
Current AI models, like large language models, don't truly grasp the world. They may create text that seems correct but often make basic mistakes about reality.
To improve AI's performance, researchers need to find better ways to teach machines commonsense reasoning, rather than relying on existing data and simulations.

AGI versus “broad, shallow intelligence”

Marcus on AI • 5019 implied HN points • 13 Jan 25

🕹 Technology AI Machine Learning Intelligence Robotics Data science

We haven't reached Artificial General Intelligence (AGI) yet. People can still easily come up with problems that AI systems can't solve without training.
Current AI systems, like large language models, are broad but not deep in understanding. They might seem smart, but they can make silly mistakes and often don't truly grasp the concepts they discuss.
It's important to keep working on AI that isn't just broad and shallow. We need smarter systems that can reliably understand and solve different problems.

AI #95: o1 Joins the API

Don't Worry About the Vase • 1164 implied HN points • 19 Dec 24

🕹 Technology AI Software Innovation Data science Programming

The release of o1 into the API is significant. It enables developers to build applications with its capabilities, making it more accessible for various uses.
Anthropic released an important paper about alignment issues in AI. It highlights some worrying behaviors in large language models that need more awareness and attention.
There are still questions about how effectively AI tools are being used. Many people might not fully understand what AI can do or how to use it to enhance their work.

Generate Synthetic Data from Personas to Train AI Chatbots

The Kaitchup – AI on a Budget • 139 implied HN points • 10 Oct 24

🕹 Technology AI Chatbots Data science Machine Learning

Creating a good training dataset is key to making AI chatbots work well. Without quality data, the chatbot might struggle to perform its tasks effectively.
Generating your own dataset using large language models can save time instead of collecting data from many different sources. This way, the data is tailored to what your chatbot really needs.
Using personas can help you create specific question-and-answer pairs for the chatbot. It makes the training process more focused and relevant to various topics.

AI Agents: Hype versus Reality, redux

Marcus on AI • 4545 implied HN points • 15 Jan 25

🕹 Technology AI Machine Learning Software Development Innovation Data science

AI agents are getting a lot of attention right now, but they still aren't reliable. Most of what we see this year are just demos that don't work well in real life.
In the long run, we might have powerful AI agents doing many jobs, but that won't happen for a while. For now, we need to be careful about the hype.
To build truly helpful AI agents, we need to solve big challenges like common sense and reasoning. If those issues aren't fixed, the agents will continue to give strange or wrong results.

DeepSeek V3 and R1

From the New World • 188 implied HN points • 28 Jan 25

🕹 Technology AI Machine Learning Computing Innovation Data science

DeepSeek has released a new AI model called R1, which can answer tough scientific questions. This model has quickly gained attention, competing with major players like OpenAI and Google.
There's ongoing debate about the authenticity of DeepSeek's claimed training costs and performance. Many believe that its reported costs and results might not be completely accurate.
DeepSeek has implemented several innovations to enhance its AI models. These optimizations have helped them improve performance while dealing with hardware limits and developing new training techniques.

Trò chơi của Diophantus

Thái | Hacker | Kỹ sư tin tặc • 2037 implied HN points • 27 Jun 24

🔬 Science Mathematics History Physics Astronomy Data science

The game of Diophantus, an ancient Greek mathematician, has had a lasting impact on cryptography and internet security, with the basis of elliptic curve cryptography originating from his mathematical puzzles.
Diophantus's famous book 'Arithmetica' went missing for centuries but resurfaced to contribute to the advancements in mathematics, leading to significant discoveries like Fermat's Last Theorem.
The study of elliptic curves, inspired by concepts like Kepler's study of ellipses, has become a central focus in mathematics, intersecting various branches like number theory, algebra, and geometry, and even impacting modern technology such as Bitcoin security.

Inverse frontiers

arg min • 158 implied HN points • 07 Oct 24

🕹 Technology Optimization Machine Learning Algorithms Data science Artificial Intelligence

Convex optimization has benefits, like collecting various modeling tools and always finding a reliable solution. However, not every problem fits neatly into a convex framework.
Some complex problems, like dictionary learning and nonlinear models, often require nonconvex optimization, which can be tricky to handle but might be necessary for accurate results.
Using machine learning methods can help solve inverse problems because they can learn the mapping from measurements to states, making it easier to compute solutions later, though training the model initially can take a lot of time.

Behind AI #7: Top Python Libraries Any AI Enthusiast Should Know

Artificial Corner • 138 implied HN points • 09 Oct 24

🕹 Technology AI Programming Data science Software Development Web Development

Python is a key language for AI because it has many useful libraries for tasks like data collection, cleaning, and visualization. Learning these libraries can help you work effectively on AI projects.
For data collection, libraries like Requests and Beautiful Soup are useful for web scraping. If you need to handle JavaScript-driven sites, Selenium and Scrapy are great options.
To visualize data, Matplotlib and Seaborn can help you create standard plots, while Plotly and Bokeh allow for interactive visualizations, making your data easier to understand.

DeepSeek-V3: Training

Gonzo ML • 126 implied HN points • 08 Feb 25

🕹 Technology Machine Learning Artificial Intelligence Data science Software Development Computer Science

DeepSeek-V3 uses a lot of training data, with 14.8 trillion tokens, which helps it learn better and understand more languages. It's been improved with more math and programming examples for better performance.
The training process has two main parts: pre-training and post-training. After learning the basics, it gets fine-tuned to enhance its ability to follow instructions and improve its reasoning skills.
DeepSeek-V3 has shown impressive results in benchmarks, often performing better than other models despite having fewer parameters, making it a strong competitor in the AI field.

✨ While tech CEOs talk up AGI imminence, real-world AI use cases emerge

Faster, Please! • 639 implied HN points • 06 Jan 25

🕹 Technology Artificial Intelligence Cybersecurity Software Development Data science Emerging technologies

In a few years, we might see AI agents start working alongside humans, which could really change how companies function.
Tech leaders believe that powerful AI could lead to huge advances in science and medicine, speeding up progress significantly.
While there is excitement about AI's potential, it's also important to manage the risks to make sure it benefits everyone.

AGI isn’t coming in 2025, and GPT-5 probably isn’t either.

Marcus on AI • 4189 implied HN points • 09 Jan 25

🕹 Technology AI Computing Data science Machine Learning Software

AGI, or artificial general intelligence, is not expected to be developed by 2025. This means that machines won't be as smart as humans anytime soon.
The release of GPT-5, a new AI model, is also uncertain. Even experts aren't sure if it will be out this year.
There is a trend of people making overly optimistic predictions about AI. It's important to be realistic about what technology can achieve right now.

Universities Are Woefully Under-Resourced For AI Research. They’re Fighting To Change That.

Big Technology • 5129 implied HN points • 22 Nov 24

🕹 Technology AI Research Higher education Public Policy Data science

Universities are struggling to keep up with AI research due to a lack of resources like powerful GPUs and data centers. They can't compete with big tech companies who have millions of these resources.
Most AI research breakthroughs are now coming from private industry, with universities lagging behind. This is causing talented researchers to prefer jobs in the private sector instead.
Some universities are trying to address this issue by forming coalitions and advocating for government support to create shared AI research resources. This could help level the playing field and foster important academic advancements.