The hottest Data science Substack posts right now

And their main takeaways

How Would I Break Into Data Science If I Had To Do It All Over Again?

High ROI Data Science • 218 implied HN points • 20 Jun 23

🕹 Technology Data science

The author wouldn't change anything about their career path in data science.
Lessons learned from mistakes were valuable for the author's growth.
The author believes there was no other way to reach their current level of expertise.

#150 - Back to our roots

davidj.substack • 59 implied HN points • 31 Oct 24

🕹 Technology Social media Data science Community Building Content creation Personal Development

Data Twitter was once a lively community for people interested in data, but it has changed significantly over time. People are looking for new platforms to connect and share ideas.
Blue Sky is gaining popularity as a new home for data enthusiasts, offering features that help with discoverability and community building. This makes it easier for users to engage and find relevant content.
Writing regularly has been rewarding and helpful in personal growth. It's a great way to clarify thoughts and boost confidence in communication, so everyone should consider writing for themselves.

Comparing Human, LLM & LLM-RAG Responses

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 59 implied HN points • 09 Feb 24

🕹 Technology Artificial Intelligence Machine Learning Healthcare Technology Research Studies Data science

The study compared answers from humans, a basic LLM, and an LLM that uses RAG to see which is most accurate in healthcare. The LLM with RAG performed the best.
Using RAG, the model was much quicker than humans, taking only about 15-20 seconds. Humans took around 10 minutes to respond.
GPT-4, especially with RAG, showed high accuracy and can support doctors by providing fast and reliable answers, but humans should still check the information.

How do transformers work?+Design a Multi-class Sentiment Analysis for Customer Reviews

The ZenMode • 134 HN points • 04 Feb 24

🕹 Technology AI NLP Machine Learning Coding Data science

Transformers are crucial in AI for tasks like natural language processing.
The encoder dissects the input text and uncovers hidden connections, while the decoder crafts the output.
Transformers employ layers like self-attention, multi-head attention, and masked self-attention for processing text.

A New Study Compares RAG & Fine-Tuning For Knowledge Base Use-Cases

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 25 Mar 24

🕹 Technology AI Machine Learning Data science Software Development Tech Trends

Choosing technology depends on what you need to achieve. Focus on the specific requirements of the problem to find the right solution.
Retrieval-Augmented Generation (RAG) is often more effective than Fine-Tuning for knowledge base tasks. It allows for quick searches and better accuracy.
RAG systems are easier to update with new information compared to Fine-Tuned models. You can simply add new data without complex adjustments.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Data Science Weekly - Issue 480

Data Science Weekly Newsletter • 279 implied HN points • 02 Feb 23

🕹 Technology Data science Machine Learning AI Programming Big Data

The newsletter is now hosted on Substack and remains free for everyone. A paid option is available for more features and interactions.
Data teams need to build trust with stakeholders to effectively measure their value and justify their budgets. Having good relationships is more important than just metrics.
Understanding MLOps is crucial for the industry. It involves not only the tools but also the culture and practices around machine learning operations.

Hunyuan-Large, AI model for open-world games, X-Portrait 2 for realistic character animations, FLUX1.1 [pro] Ultra and Raw, Magentic-One, Hume AI App, action model for GUI agents and More

AI Brews • 15 implied HN points • 08 Nov 24

🕹 Technology AI Models Gaming Animation Software Data science

Tencent has released Hunyuan-Large, a powerful AI model with lots of parameters that can outperform some existing models. It's good news for open-source projects in AI.
Decart and Etched introduced Oasis, a unique AI that can generate open-world games in real-time. It uses keyboard and mouse inputs instead of just text to create gameplay.
Microsoft's Magentic-One is a new system that helps solve complex tasks online. It's aimed at improving how we manage jobs across different domains.

Scaling human feedback with fine-tuned open-source LLMs

LLMs for Engineers • 59 implied HN points • 30 Jan 24

🕹 Technology AI Machine Learning Open Source Data science

Fine-tuned open-source models like Llama and Mistral can produce accurate feedback, similar to high-performing custom models. They're a great option for companies needing control over their data.
Using tools like Axolotl and Modal makes it easier to fine-tune these models. They help create customized training jobs and simplify deploying models across multiple GPUs.
Fine-tuning significantly improves the clarity and structure of the model's output. It reduces irrelevant information, allowing for cleaner, more useful results.

Chain-of-Instructions (CoI) Fine-Tuning & Going Beyond Instruction Tuning

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 21 Mar 24

🕹 Technology Artificial Intelligence Machine Learning Data science Natural Language Computing

Chain-of-Instructions (CoI) fine-tuning allows models to handle complex tasks by breaking them down into manageable steps. This means that a task can be solved one part at a time, making it easier to follow.
This new approach improves the model's ability to understand and complete instructions it hasn't encountered before. It's like teaching a student to tackle complex problems by showing them how to approach each smaller task.
Training with minimal human supervision leads to efficient dataset creation that can empower models to reason better. It's as if the model learns on its own, becoming smarter and more capable through well-designed training.

You Can Break A Predictive Model By Using It - How To Spot And Fix Performative Prediction

Mindful Modeler • 299 implied HN points • 27 Sep 22

🕹 Technology AI Data science Machine Learning Predictive Models Model Deployment

Predictions can change the outcome, leading to performative prediction. This can impact model performance.
Performative prediction is common but often overlooked, affecting tasks like rent prediction and churn modeling.
To deal with performative prediction, consider achieving performative stability, retraining models frequently, and reframing tasks as reinforcement learning.

Please Stop Saying Long Context Windows Will Replace RAG

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 18 Mar 24

🕹 Technology AI Machine Learning Data science Natural Language Software Development

Long context windows (LCWs) and retrieval-augmented generation (RAG) serve different purposes and won’t replace each other. LCWs work well when asking multiple questions at once, while RAG is better for separate inquiries.
Using LCWs can get really expensive because they involve processing a lot of data at once. In contrast, RAG uses smaller, focused data chunks, which helps keep costs down.
Research shows that LLMs perform better when important information is at the start or end of a long context. So, relying only on LCWs can lead to problems since crucial details may get overlooked.

The AI Supernova

Perspective Agents • 24 implied HN points • 15 Jan 25

🕹 Technology Artificial Intelligence Digital innovation Data science Automation Future Trends

AI is changing how we work and learn. Jobs will focus more on things like emotional intelligence and problem-solving instead of routine tasks.
There is a big gap between those who understand and use AI effectively and those who don't. This gap can lead to businesses being left behind if they don't adapt.
Whether it's through simulations or understanding people's feelings, human touch will always matter. Genuine moments of connection can outshine machines, even if they seem perfect.

Awarding the amazing autosegmentation work from 2024

Vesuvius Challenge • 21 implied HN points • 24 Jan 25

🕹 Technology Software Innovation Artificial Intelligence Research Data science

Two teams were awarded for their amazing work on automating scroll segmentation. They worked really hard, using only a few hours of human help to get impressive results.
The new methods focus on breaking down the task into smaller parts, like surface prediction and fitting, making it easier and faster to recover lost texts from ancient scrolls.
Even though there are still challenges, the community is excited about the progress and future plans, like getting better at detecting ink on more scrolls.

Data Science Weekly - Issue 483

Data Science Weekly Newsletter • 239 implied HN points • 23 Feb 23

🕹 Technology Data science Artificial Intelligence Machine Learning Data Visualization Programming

The 2023 MAD landscape provides insights into machine learning and data trends. It has sections on the current market, infrastructure, and AI trends.
A new tool called PyGWalker turns Pandas dataframes into easy-to-explore visual interfaces. It's great for beginners wanting to visualize their data without technical hassle.
Cleaning data is essential for reliable research findings. New methods are being shared to improve and standardize the data cleaning process, making it more efficient.

Do Not Spend too Much Time "Getting Good" at Dealing with Current AML GPT LLMs

Brad DeLong's Grasping Reality • 169 implied HN points • 04 Mar 24

🕹 Technology Machine Learning Artificial Intelligence Data science Internet Software

It's uncertain how current AML GPT LLMs will be most useful in the future, so spending too much time trying to master them may not be the best approach.
Proper prompting is crucial when working with AML GPT LLMs as they can be capable of more than initially apparent. Good prompts can make tasks that seem impossible into achievable ones.
Understanding the thought processes and effective way to prompt AML GPT LLMs is essential, as their responses can vary based on subtle changes or inadequate prompting.

Data Science Weekly - Issue 481

Data Science Weekly Newsletter • 239 implied HN points • 09 Feb 23

🕹 Technology Data science AI Research Big Data Machine Learning Software Development

Big Data is changing, and it's not as big a deal as we thought. Hardware is getting better faster than data sizes are growing.
Research in AI can be learned just like a sport. It's about practicing skills like designing experiments and writing papers.
Data Analytics can really help businesses understand their performance and make smarter decisions. It’s all about using data to solve problems and anticipate future issues.

3 Techniques to Make your Machine Learning more efficient[Technique Tuesdays]

Technology Made Simple • 99 implied HN points • 04 Apr 23

🕹 Technology Machine Learning AI Software Engineering Tech industry Data science

Reducing the number of features in your data can improve performance and keep costs down in machine learning processes.
Active learning focuses on prioritizing data points for efficient machine learning model training.
Using filters and simpler models for specific tasks can lead to better performance and cost savings compared to always using large, powerful models in AI.

GPT-4 (sometimes) captures the wisdom of the crowd

The Counterfactual • 99 implied HN points • 25 Sep 23

🕹 Technology AI Data science Machine Learning

Researchers often use survey data to understand human behavior, but collecting reliable human responses can be complicated and expensive. Using large language models (LLMs) like GPT-4 could make this process easier and cheaper.
LLMs can sometimes produce responses that closely match the average opinions of many people. In some cases, their answers were actually more aligned with the average responses than individual human judgments.
While LLMs can be helpful in gathering data quickly and inexpensively, it's important to be careful. They might not always be accurate or representative of all viewpoints, so it's wise to compare LLM results with human responses to ensure quality.

Should Your Data Science Team be Centralized or Decentralized? Why Not Intentionally Rotate?

Mike Talks AI • 98 implied HN points • 19 May 23

🕹 Technology Data science Team Management Organizational Design

Consider a hybrid approach for data science teams to balance the strengths of both centralized and decentralized setups.
Some companies are experimenting with intentionally rotating between centralized and decentralized structures every few years.
Switching between centralization and decentralization periodically allows for exploration and scalability of diverse ideas within data science teams.

The Tech Buffet #14: A 3-Step Approach To Evaluate Your LLM-based Applications

The Tech Buffet • 79 implied HN points • 19 Nov 23

🕹 Technology Artificial Intelligence Machine Learning Data science Software Development Tech Trends

Creating a good dataset is important to evaluate your LLM-based applications. You can use LLMs to generate questions and answers from your data, which helps in building a reliable test set.
Running your application over this dataset helps you see how well it retrieves information and generates answers. Keeping track of the documents it finds will make your evaluation easier.
Finally, you should measure how well your application retrieves relevant documents and how good the answers are. This will help you understand what works best and where you can improve.

Data Science Weekly - Issue 487

Data Science Weekly Newsletter • 199 implied HN points • 23 Mar 23

🕹 Technology Data science Machine Learning AI Software Development Visualization

This week's newsletter shares useful links in data science, machine learning, and AI. It's a great way to stay updated in these fields.
One highlighted article discusses the importance of prompt engineering in interacting with language models. It's about how to communicate effectively with AI for desired results.
There's also a report on how generative models like GPT might impact jobs. It shows that many workers could see changes in their tasks due to AI advancements.

Our Top Book Pick for 2022

Gradient Flow • 199 implied HN points • 15 Dec 22

🚌 Education Data science Machine Learning Books

The recommended book of the year is a comprehensive guide for data scientists and data teams, offering practical advice and real-world insights in using data science effectively and ethically.
ActivityPub is a W3C standard and decentralized social networking protocol, gaining traction as a viable alternative to centralized services for community building.
SkyPilot, a newly launched project, presents a unified interface for running machine learning workloads on any cloud, catering to the need for cost-effective cloud computing in the coming year.

Multimodal Search Using Vector DB

The Beep • 39 implied HN points • 25 Feb 24

🕹 Technology Artificial Intelligence Data science Machine Learning Software Development Information Retrieval

Multimodal search lets you look for information using different types of data like text, images, and audio at the same time. This makes finding what you need much easier and faster.
Embeddings are special numbers that represent words, images, or sounds so computers can understand them. They help machines learn about relationships and contexts in the data they process.
Using vector databases, we can store these embeddings efficiently. This technology enables smarter applications like image searches or recognizing songs quickly.

How to Go to BBQ Hell via the Use of Data

Think Future • 79 implied HN points • 02 Nov 23

🎭️ Culture Food & Drink Data science Critical Thinking Expertise Futurism

The importance of expertise in interpreting data findings - data can sometimes lead to nonsensical conclusions without proper expertise to guide the analysis.
Be cautious of drawing conclusions solely based on data - critical thinking is essential to avoid errors in analysis, like the case of Trip Advisor's BBQ city rankings.
Consulting with longtime experts is crucial before accepting data-driven findings as 'rock-solid' - having seasoned professionals review results can help prevent misinterpretations and errors.

Edge 453: Distillation Across Different Modalities

TheSequence • 28 implied HN points • 03 Dec 24

🕹 Technology AI Data science Machine Learning Web Development Research

Cross-modal distillation allows one model to teach another model that works with a different type of data. This means you can share knowledge even if the models are processing images, text, or something else entirely.
This method can be really helpful when there's not much paired data available. It helps improve the learning process in situations where gathering data might be difficult.
Hugging Face’s Gradio lets developers create AI applications for the web easily. It's a neat tool that helps bring AI to everyday use in a user-friendly way.

How RLHF actually works

Democratizing Automation • 306 implied HN points • 21 Jun 23

🕹 Technology AI Machine Learning Data science Open Source Scaling

RLHF works when there is a signal that vanilla supervised learning alone doesn't work, like pairwise preference data.
Having a capable base model is crucial for successful RLHF implementation, as imitating models or using imperfect datasets can greatly affect performance.
Preferences play a key role in the RLHF process, and collecting preference data for harmful prompts is essential for model optimization.

How the CIA Writes Python

Luminotes • 28 implied HN points • 15 Dec 24

🕹 Technology Programming Software Development Cybersecurity Data science

The CIA has a unique Python style guide, focusing on clarity and readability, with special rules for exceptions, globals, and list comprehensions.
They use specific tools like PyCharm for development and have a custom setup for installing Python and managing packages within secure environments.
There are no strict rules governing coding practices; instead, individuals make choices based on their preferences and the limitations of their working conditions.

Data Science Weekly - Issue 482

Data Science Weekly Newsletter • 199 implied HN points • 16 Feb 23

🕹 Technology Data science AI Tools Machine Learning Data Visualization Deep Learning

Visual analytics can help make deep learning models easier to understand. Researchers are working to fill gaps and challenges in this area.
AI tools like ChatGPT might change how we visualize data in the future. They could make it easier to find and interpret information quickly.
A new method called Lion offers a better optimization algorithm for training deep neural networks. It uses less memory than existing methods like Adam.

Edge 445: A New Series About Knowledge Distillation

TheSequence • 35 implied HN points • 05 Nov 24

🕹 Technology AI Data science Machine Learning Computing Engineering

Knowledge distillation helps make large AI models smaller and cheaper. This is important for using AI on devices like smartphones.
A key goal of this process is to keep the accuracy of the original model while reducing its size.
The series will include reviews of research papers and discussions on frameworks like Google's Data Commons that support factual knowledge in AI.

Breaking into ML as a New Grad

Kndrej’s Substack • 3 HN points • 14 Aug 24

🕹 Technology AI Machine Learning Software Engineering Data science Job Market

Breaking into machine learning (ML) requires not just basic knowledge but also a deep understanding of the math and engineering behind models. Completing online courses is only a starting point.
Internships and real project experience are crucial for landing a job in ML. It's important to have skills that stand out, like publications or open-source contributions.
Interview preparation is key; practicing coding challenges and understanding ML concepts is necessary to succeed. Networking and applying quickly to job postings can improve your chances.

Data Design For Fine-Tuning LLM Long Context Windows

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 03 May 24

🕹 Technology AI Machine Learning Data science Natural Language Software Development

Fine-tuning large language models (LLMs) can help them better understand and use long pieces of text. This means they can make sense of information not just at the start and end but also in the middle.
The 'lost-in-the-middle' problem happens because LLMs often overlook important details in the middle of texts. Training them with more focused examples can help address this issue.
The IN2 training approach emphasizes that crucial information can be found anywhere in long texts. It uses specially created question-answer pairs to teach models to pay attention to all parts of the context.

MistralAI Reveals the Mystery

Sector 6 | The Newsletter of AIM • 59 implied HN points • 13 Dec 23

🕹 Technology AI Startups Software Data science Machine Learning

MistralAI has launched a new model called Mixtral 8x7B that is faster and more efficient than competitors like Llama 2 70B. It can provide great performance while being cost-effective.
Mixtral can handle a lot of information at once, processing up to 32,000 tokens and supporting multiple languages such as English, French, and German.
This model also shows strong abilities in generating code and can be fine-tuned to follow instructions well, which is helpful for various applications.

A Must for Indic Language Models

Sector 6 | The Newsletter of AIM • 39 implied HN points • 09 Feb 24

🕹 Technology AI Machine Learning Data science Language Models Benchmarks

There is a big need for benchmarks specifically for Indian languages. This helps assess how well language models perform in those languages.
Upcoming models like Tamil Llama and Odia Llama are pushing for the creation of these benchmarks. They could lead to better evaluations for these Indic language models.
Having a leaderboard for Indic language models is vital. It will spotlight advancements and improvements within India's language technology space.

I’m making an AI powered scraper

serious web3 analysis • 26 implied HN points • 15 Aug 24

🕹 Technology AI Web Tools Data science Software Chrome Extensions

FetchFox is an AI-powered Chrome extension that makes web scraping easy for everyone, even if you can't code. Just a few clicks allow you to gather useful data from any website.
Traditional web scraping requires programming skills and can be time-consuming. FetchFox simplifies the process, letting anyone scrape data in minutes rather than hours.
FetchFox is designed to work like a human visitor, which helps it avoid being blocked by websites. This means it can extract data more effectively than traditional methods.

The Tech Buffet #4: Turn Complex English Instructions Into Executable SQL With LLMs

The Tech Buffet • 79 implied HN points • 16 Sep 23

🕹 Technology AI Software Programming Data science SaaS

Vanna.AI is a tool that helps turn plain English questions into complex SQL queries quickly. This makes it easier for people who might not be familiar with coding to extract data from databases.
The tool uses a method called Retrieval Augmented Generation (RAG) to understand user queries better. It prepares the right context for the questions by using metadata before generating SQL.
Vanna allows users to continuously improve its performance by incorporating user-feedback into the training process. This feature helps the tool learn and adapt over time, ensuring better results.

The Rise of the AI Dragon

Sector 6 | The Newsletter of AIM • 59 implied HN points • 04 Dec 23

🕹 Technology AI Open Source Machine Learning Data science Natural Language

There are new AI models based on LLaMA, like DeepSeek, that are showing great performance. These models are pushing the boundaries of what AI can do.
Chinese companies are making significant progress in open source AI models and many are now leading in popularity and performance.
DeepSeek and other models are being developed with the goal of exploring artificial general intelligence, which aims to create more advanced AI systems.

E5 - Roles and Responsibilities of an AI Product Manager

The Product Channel By Sid Saladi • 6 implied HN points • 08 Dec 24

🕹 Technology Artificial Intelligence Product Management Data science Tech industry User Experience

AI product managers play a key role in creating and managing AI-powered products. They need to combine technical knowledge with an understanding of user needs.
Their responsibilities include researching AI applications, creating product strategies, and leading development teams. They ensure that products are both viable in the market and valuable to users.
To succeed, AI product managers should have skills in AI, business, and user experience. A mix of education in tech, business, and design helps prepare them for this role.

Generating Insights from Research with AI

Addition • 78 implied HN points • 28 Jun 23

🕹 Technology AI Research Insights Machine Learning Data science

AI can synthesize vast amounts of information to generate insights faster than humans.
AI can complement human strategists, giving them superpowers to transform the art of strategy.
The tool shared in the post helps improve human strategists' AI superpowers by synthesizing research, generating insights, and providing creative interpretations.

Introduction to R for Data Science (Part Three)

The Software & Data Spectrum • 78 implied HN points • 23 Mar 23

🚌 Education Data science Programming Lists

Data frames in R help organize and mix data types effectively
Data frame indexing allows for precise data extraction and selection
R programming involves logical operators for combining comparison statements

What's happening at the intersection of ML and Engineering.

Arkid’s Newsletter • 17 HN points • 30 Sep 24

🕹 Technology Machine Learning Software Engineering Data science Artificial Intelligence Infrastructure

AI and machine learning are creating a lot of hype, but it's important to separate the noise from the real value. Just like in the dot-com boom, there will be winners, but it won't be easy to find them.
Many companies are wasting money on consultants who offer little help without delivering real results. To succeed in AI, businesses need to focus on building intelligent products that can learn and iterate based on user feedback.
There's concern about AI taking over jobs in software and machine learning, but skilled professionals will still be needed. It’s crucial for entry-level workers to build solid expertise in their field and adapt to new developments in AI.