The hottest Data science Substack posts right now

And their main takeaways

Gambling with language models

Rain Clouds • 51 implied HN points • 31 Dec 24

🕹 Technology Data science

Using AI models, like ModernBert, can help in predicting which stocks might perform better based on financial reports and market data. This means you can get insights without needing to be a finance expert.
The project combines cloud computing with machine learning, making it easier to process large amounts of financial data quickly. This is important for anyone looking to analyze stocks more efficiently.
While the model can make predictions, it's important to remember that investing in stocks always carries risks. Just because a model suggests a stock might do well, it doesn't guarantee success.

Tech Talks Weekly #4

Tech Talks Weekly • 19 implied HN points • 19 Feb 24

🕹 Technology Data science

The newsletter summarizes recent tech talks from various conferences, making it easier for readers to find valuable content. It's a great resource for anyone interested in technology.
Each issue features a selection of must-watch talks, along with a list of new uploads categorized by conference. This helps viewers easily discover trending topics in tech.
Readers are encouraged to provide feedback on the newsletter format and share it with friends or colleagues to grow the community. It's all about connecting more people to interesting tech discussions.

OpenAI's o-1 and inference-time scaling laws

Tanay’s Newsletter • 63 implied HN points • 28 Oct 24

🕹 Technology Data science

OpenAI's o-1 model shows that giving AI more time to think can really improve its reasoning skills. This means that performance can go up just by allowing the model to process information longer during use.
The focus in AI development is shifting from just making models bigger to optimizing how they think at the time of use. This could save costs and make it easier to use AI in real-life situations.
With better reasoning abilities, AI can tackle more complex problems. This gives it a chance to solve tasks that were previously too difficult, which might open up many new opportunities.

Visualise & Discover RAG Data

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 23 Jan 24

🕹 Technology Data science

RAGxplorer is a tool that helps visualize and explore data chunks, making it easier to understand how they relate to different topics.
The process of Retrieval-Augmented Generation (RAG) involves breaking documents into smaller chunks to improve how data is retrieved and used with language models.
Visualizing data can help identify problems like missing information or unexpected results, allowing users to refine their questions or understand their data better.

Catalog of Catalogs

davidj.substack • 59 implied HN points • 14 Nov 24

🕹 Technology Data science

Data tools create metadata, which is important for understanding what's happening in data management. Every tool involved in data processing generates information about itself, making it a catalog.
Not all catalogs are for people. Some are meant for systems to optimize data processing and querying. These system catalogs help improve efficiency behind the scenes.
To make data more accessible, catalogs should be integrated into the tools users already work with. This way, data engineers and analysts can easily find the information they need without getting overwhelmed by unnecessary data.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

LangSmith by LangChain

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 22 Jan 24

🕹 Technology Data science

LangSmith helps organize and manage projects and data for applications built with LangChain. It allows you to see your tasks in a neat layout and check performance easily.
The platform offers tools for testing and improving agents, especially when handling multiple tasks at the same time. This helps ensure that applications run smoothly.
LangSmith allows users to create datasets that can improve agent performance. It also has features to evaluate how well agents are doing by comparing their outputs to expected results.

The Sequence Chat: The Transition that Changes Everything. From Pretraining to Post-Training in Foundation Models

TheSequence • 56 implied HN points • 04 Dec 24

🕹 Technology Data science

The transition from pretraining to post-training in AI models is a big deal. This change helps improve how AI can reason and learn from data.
New models like DeepSeek's R1 and Alibaba's QwQ are now using this transition to become smarter and more effective. They can solve complex problems better than before.
The shift is moving away from old methods like reinforcement learning with human feedback. Instead, there are new ways being developed that promise to make AI work even better.

How do transformers work?+Design a Multi-class Sentiment Analysis for Customer Reviews

The ZenMode • 134 HN points • 04 Feb 24

🕹 Technology Data science

Transformers are crucial in AI for tasks like natural language processing.
The encoder dissects the input text and uncovers hidden connections, while the decoder crafts the output.
Transformers employ layers like self-attention, multi-head attention, and masked self-attention for processing text.

Popular LLM Datasets

The Beep • 19 implied HN points • 21 Jan 24

🕹 Technology Data science

Datasets are crucial for training machine learning models, including language models. They help the model learn patterns and make predictions.
Popular sources for datasets include Project Gutenberg and Common Crawl, which provide large amounts of text data for training language models.
Instruction tuning datasets are used to adapt pre-trained models for specific tasks. These help the model perform better in given situations or instructions.

PumpkinSeed Protein Sequencing

ASeq Newsletter • 21 implied HN points • 17 Jun 25

🕹 Technology Data science

PumpkinSeed is a startup focused on new protein sequencing technology. They use a method that analyzes light patterns to determine protein sequences without needing labels.
The technology involves measuring the Raman spectra of peptides and using AI to interpret the data. This helps to figure out the order of amino acids in a protein.
There's potential for the method, but questions remain about how easily it can be scaled for larger samples. The benefit and size of the market for this technology are still being evaluated.

Edge 451: Is One Teacher Enough? Understanding Multi-Teacher Distillation

TheSequence • 56 implied HN points • 26 Nov 24

🕹 Technology Data science

Using multiple teachers in distillation is better than just one. This method helps combine different areas of knowledge, making the student model more powerful.
Each teacher can focus on a specific type of knowledge, like understanding features or responses. This specialization leads to a more balanced learning process.
Although this approach might be more expensive to implement, it creates a stronger and less biased model overall.

OpenAI isn’t Exciting Any Longer

Sector 6 | The Newsletter of AIM • 39 implied HN points • 27 Jun 23

🕹 Technology Data science

OpenAI is losing talented employees to Google, indicating a shift in the competitive landscape of AI.
Some former OpenAI staff are unhappy with leadership, feeling that the company's vision is too focused on ChatGPT.
There are concerns about the lack of direction at OpenAI, with rumors about the CEO's understanding of the business being superficial.

The Anatomy of the Least Squares Method, Part Four

The Palindrome • 5 implied HN points • 17 Nov 25

🕹 Technology Data science

You can use the least-squares method to understand and analyze regression models well. It's a handy tool for data scientists.
Large language models like GPT-2 aren't as complex as they seem. A basic understanding of math can help you learn how they work.
Using Python to model LLMs allows you to see how the math applies in real time. Following along with code can really boost your learning.

Edge 364: About COSP and USP: Two New LLM Reasoning Methods Built by Google Research

TheSequence • 133 implied HN points • 25 Jan 24

🕹 Technology Data science

Two new LLM reasoning methods, COSP and USP, have been developed by Google Research to enhance common sense reasoning capabilities in language models.
Prompt generation is crucial for LLM-based applications, and techniques like few-shot setup have reduced the need for large amounts of data to fine-tune models.
Models with robust zero-shot performance can eliminate the need for manual prompt generation, but may have less potent results due to operating without specific guidance.

#150 - Back to our roots

davidj.substack • 59 implied HN points • 31 Oct 24

🕹 Technology Data science

Data Twitter was once a lively community for people interested in data, but it has changed significantly over time. People are looking for new platforms to connect and share ideas.
Blue Sky is gaining popularity as a new home for data enthusiasts, offering features that help with discoverability and community building. This makes it easier for users to engage and find relevant content.
Writing regularly has been rewarding and helpful in personal growth. It's a great way to clarify thoughts and boost confidence in communication, so everyone should consider writing for themselves.

Weekly Top Picks #67

The Algorithmic Bridge • 116 implied HN points • 18 Mar 24

🕹 Technology Data science

The post discusses Nvidia GTC keynote, BaaS in science, Apple's potential collaboration with Google Gemini, and more key AI topics of the week.
It features conversations between Sam Altman and Lex Friedman, touches on jobs in the AI era, and examines the response from NYT to OpenAI.
There's a question about whether OpenAI's Sora model is trained using YouTube videos, among other intriguing topics.

Unearthing Datasets Preparation for LLM

The Beep • 19 implied HN points • 11 Jan 24

🕹 Technology Data science

Good datasets are really important for training large language models (LLMs). If the data isn't well prepared, the model won't perform well.
To prepare a dataset, you need to gather data, clean it up, and then convert it into a format the model can understand. Each step is crucial.
While training LLMs, it's important to think about issues like data bias and privacy. This can affect how well the model works and who it might unfairly impact.

Active Prompting with Chain-of-Thought for Large Language Models

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 05 Jan 24

🕹 Technology Data science

AI can help improve language models by using a four-step process: estimating uncertainty, selecting uncertain questions, annotating them, and making final inferences. This helps ensure better answers.
Using human annotations along with AI makes the training data clearer and reduces confusion. It allows us to focus on the most important information for the models.
Companies can benefit from this approach by streamlining how they handle data. It promotes a more organized way of discovering, designing, and developing data.

DataFrame

davidj.substack • 35 implied HN points • 20 Feb 25

🕹 Technology Data science

Polars Cloud allows for scaling across multiple machines, making it easier to handle large datasets than using just a single machine. This helps in processing data faster and more efficiently.
Polars is simpler to use compared to Pandas and often performs better, especially when transforming data for machine learning tasks. It supports familiar methods that many users already know.
Unlike SQL, which runs well on cloud services, using Pandas and R for large-scale transformations has been challenging. The new Polars Cloud aims to bridge this gap, providing more scalable solutions.

Looking back on AI in 2024

Generating Conversation • 46 implied HN points • 19 Dec 24

🕹 Technology Data science

AI companies need to show clear value to succeed. This means saving money or making profits, not just improving productivity.
Building customer trust is key for AI products. Letting customers test and experience the product firsthand is often more effective than complicated evaluation tools.
User experience with AI tools is really important. Good AI needs to be easy and enjoyable to use, which is a challenge that still needs solving.

How to Create Good Documentation in Software Engineering and Tech[Technique Tuesdays]

Technology Made Simple • 59 implied HN points • 19 Oct 22

🕹 Technology Data science

Good documentation in software engineering is crucial as it provides clarity to the team about goals and work done, enhancing productivity.
Key pillars of good documentation include having a vision for the company and products, outlining resource/situational constraints, detailing data sources and processing, tracking projects in progress, sharing actual code, and establishing ownership.
Benefits of good documentation in tech include aligning teams, clarifying vision and plans, reducing onboarding time, and promoting asynchronicity in an increasingly remote working environment.

Scalable Embedding based retrieval for target side value

Recommender systems • 23 implied HN points • 17 May 25

🕹 Technology Data science

Scalability is key for embedding-based recommendation systems, especially when dealing with billions of users. Finding effective ways to limit the search can help manage this challenge.
It’s important to deliver value not just to viewers but also to the recommended targets, as this can improve user retention. Balancing recommendations for both sides can create a better experience.
Using advanced algorithms can help ensure viewers don’t get overwhelmed with too many recommendations while also making sure that every target gets the attention they need. This balance is crucial for effective recommendations.

AI Roundup 097: Model Mayhem

Artificial Ignorance • 46 implied HN points • 13 Dec 24

🕹 Technology Data science

Google has launched new AI models such as Gemini 2.0, which can create text, images, and audio quickly. They also introduced tools to summarize video content and help users with web tasks.
OpenAI released several features, including a text-to-video model named Sora for paying users. They also improved ChatGPT's digital editing tool and added new voice capabilities for video interactions.
Meta and other companies are also advancing in AI with new models for cheaper yet effective performance and tools for watermarking AI-generated videos, showing that competition in AI is heating up.

Can one black box explain another?

The Counterfactual • 39 implied HN points • 29 May 23

🕹 Technology Data science

Large language models (LLMs) like GPT-4 are often referred to as 'black boxes' because they are difficult to understand, even for the experts who create them. This means that while they can perform tasks well, we might not fully grasp how they do it.
To make sense of LLMs, researchers are trying to use models like GPT-4 to explain the workings of earlier models like GPT-2. This involves one model generating explanations about the neuron activations of another model, aiming to uncover how they function.
Despite the efforts, current methods only explain a small fraction of neurons in these LLMs, which indicates that more research and new techniques are needed to better understand these complex systems and avoid potential failures.

Welcome to my Substack!

Amgad’s Substack • 19 implied HN points • 22 Dec 23

🕹 Technology Data science

The Substack focuses on machine learning, data science, and AI.
Expect in-depth articles, case studies, opinion pieces, and curated resources about the latest advancements in AI.
Readers are encouraged to subscribe, engage, and follow on social media for a more interactive experience.

Edge 447: Not All Model Distillations are Created Equal

TheSequence • 49 implied HN points • 12 Nov 24

🕹 Technology Data science

There are different types of model distillation that help create smaller, more efficient AI models. Understanding these types can help in choosing the right method for specific tasks.
The three main types of model distillation are response-based, feature-based, and relation-based. Each has its own strengths and can be used depending on what you need from the model.
Response-based distillation is usually the easiest to implement. It focuses on how the student model responds to similar inputs as the teacher model.

How do we make sense of all of this?

Sunday Letters • 39 implied HN points • 18 Jun 23

🕹 Technology Data science

It's normal to feel overwhelmed with all the rapid changes in technology and AI. Many people are struggling to keep up, and that's okay.
Using first principles can help us find clarity in confusing situations. Focusing on what's truly important and how things work can guide our understanding.
Looking at data and history can help us make sense of current trends. By finding patterns and using math, we can better understand the complexities of new technologies.

Did OpenAI create the best weapon detection software available with ChatGPT-4o?

School Shooting Data Analysis and Reports • 4 HN points • 04 Jun 24

🕹 Technology Data science

AI weapon detection software struggles to differentiate between weapons and weapon-shaped objects like umbrellas or sticks, leading to issues in accuracy and efficiency.
OpenAI's ChatGPT-4o offers more advanced weapon detection capabilities from image analysis compared to current market options, recognizing context better.
ChatGPT-4o was successful in identifying guns and gun-like objects in various scenarios, showcasing a high level of performance in image classification and context understanding.

Practical Reinforcement Learning and Differential Privacy

Gradient Flow • 119 implied HN points • 17 Feb 22

🕹 Technology Data science

The ratio of data scientists to data engineers varies based on factors like tools, infrastructure, and use cases, with no set ideal ratio.
Interesting developments include a new podcast discussing machine learning infrastructure at Netflix, imperceptible NLP attacks, and evolving data science training programs.
Exciting tools and updates in the data and machine learning space, like practical reinforcement learning applications, scalable differential privacy for Python developers, and the Orbit version 1.1 for Bayesian time-series analysis.

Edge 372: Learn About CALM, Google DeepMind's Method to Augment LLMs with Other LLMs

TheSequence • 98 implied HN points • 22 Feb 24

🕹 Technology Data science

Knowledge augmentation is crucial in LLM-based applications with new techniques constantly evolving to enhance LLMs by providing access to external tools or data.
Exploring the concept of augmenting LLMs with other LLMs involves merging general-purpose anchor models with specialized ones to unlock new capabilities, such as combining code understanding with language generation.
The process of combining different LLMs might require additional training or fine-tuning of the models, but can be hindered by computational costs and data privacy concerns.

Friend Recommendation Retrieval in a social network

Recommender systems • 43 implied HN points • 24 Nov 24

🕹 Technology Data science

Friend recommendation systems use connections like 'friends of friends' to suggest new friends. This is a common way to make sure suggestions are relevant.
Two Tower models are a new approach that enhances friend recommendations by learning from user interactions and focusing on the most meaningful connections.
Using methods like weighted paths and embeddings can improve recommendation accuracy. These techniques help to understand user relationships better and avoid common pitfalls in recommendations.

🚀 Keeping it Real (as usual)

Data Science at Home • 19 implied HN points • 03 Dec 23

🕹 Technology Data science

Data Science at Home focuses on real-world applications without hype or drama.
The podcast kept its original name to stay true to its core identity and values.
The website refresh includes a sleek design, fresh episodes, interactive features, and a community section.

Data Storytelling: 5 Simple Tips With Python Plotly to Spice Up Your Data

Data at Depth • 19 implied HN points • 03 Dec 23

🕹 Technology Data science

Data storytelling is crucial for making data accessible and interesting.
Python's Plotly library is a powerful tool for visualizing data sets in meaningful ways.
Consider supporting publications like Data at Depth for more insightful content.

What is RAG?

Technically • 50 implied HN points • 07 Oct 24

🕹 Technology Data science

RAG helps make AI models like GPT-4 more personal and accurate by using specific data from users.
By embedding user data directly into models, RAG creates responses that are more tailored to individual needs.
RAG is becoming a common method to improve LLMs, alongside the traditional way of fine-tuning models.

Newsletter #1 - Welcome to Data at Depth!

Data at Depth • 19 implied HN points • 01 Dec 23

🕹 Technology Data science

The newsletter 'Data at Depth' aims to explore topics in computer science and data analytics, sharing insights from a professor with 20+ years of experience in the field.
The constant growth and exploration in the world of AI-generated data leaves many individuals curious and on a learning journey.
Readers can subscribe to Data at Depth for a 7-day free trial to access full post archives and continue learning about data and computer science topics.

The Sequence Knowledge #463: Wrapping Up our Series About Knowledge Distillation: Pros and Cons

TheSequence • 35 implied HN points • 07 Jan 25

🕹 Technology Data science

Knowledge distillation is a method where a smaller model learns from a larger, more complex model. This helps make the smaller model efficient while retaining essential features.
The series covered different techniques and challenges in knowledge distillation, highlighting its importance in machine learning and AI development. Understanding these can help when deciding if this approach is suitable for your projects.
It's useful to be aware of both the benefits and drawbacks of knowledge distillation. This helps in figuring out the best way to implement it in real-world applications.

Vesuvius Challenge Progress Prizes: December Edition

Vesuvius Challenge • 31 implied HN points • 24 Jan 25

🕹 Technology Data science

The community is focused on improving data quality, like using better labels and refining how they categorize information. This will help them create automated tools for analyzing scrolls more effectively.
Several contributors have made significant advancements in developing new segmentation models and tools, which will help in analyzing scroll data. These innovations are key for understanding ancient texts.
2024 has been a great year for teamwork and progress as everyone shares their findings. The hard work from many people is leading to quick improvements in technology for studying historical scrolls.

The Anatomy of the Least Squares Method, Part Three

The Palindrome • 4 implied HN points • 11 Nov 25

🕹 Technology Data science

Using real data helps you understand the real-world quirks and problems that simulations can't show. It's like learning to drive in a car instead of a video game.
Real data can reveal hidden patterns and insights about how things work, giving you a better chance to discover new information.
Cleaning and transforming your data is crucial for accurate analysis. You need to tackle issues like outliers and non-normal distributions to get reliable results.

The Data Hierarchy of Needs

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Nov 23

🕹 Technology Data science

Data can be structured in a hierarchy similar to Maslow's Hierarchy of Needs, where each level is necessary for the enjoyment of the level above it. This concept applies to data engineering pipelines.
Data pipelines are crucial for deriving business value, even if they are complex and not directly visible. Architectural considerations and infrastructure choices play a significant role in making data a priority in a business.
When considering data infrastructure, such as data ingestion tools, cloud warehouses, BI tools, and others, it's important to plan the entire stack and not just jump to specific infrastructure. Consider aspects like version control, security, integration, and orchestration.

The Bias vs Variance Tradeoff [Math Mondays]

Technology Made Simple • 39 implied HN points • 06 Dec 22

🕹 Technology Data science

Understanding the Bias-Variance Tradeoff is crucial in Data Science and Machine Learning.
Bias in a Machine Learning Model refers to prediction errors, while Variance accounts for the spread in predictions.
High Bias can lead to underfitting, where the model doesn't grasp the data pattern fully, while High Variance can result in overfitting, where the model learns noise in the data.