The hottest Data science Substack posts right now

And their main takeaways

Intents Are Not Going Away…RoNID Is A New Intent Discovery Framework

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 26 Apr 24

🕹 Technology AI NLP Chatbots Machine Learning Data science

RoNID helps identify user intents more accurately, allowing chatbots to understand what users really want to talk about. This means better conversations and less frustration.
The framework uses two main steps: generating reliable labels and organizing data into clear groups. This makes it easier to see which intents are similar and which are different.
RoNID outperforms older methods, improving the chatbot’s understanding by creating clearer and more accurate intent classifications. This leads to a smoother user experience.

How could we know if Large Language Models understand language?

The Counterfactual • 219 implied HN points • 18 Oct 22

🕹 Technology Artificial Intelligence Machine Learning Natural Language Processing Computing Data science

There's a big debate about whether large language models truly understand language or if they're just mimicking patterns from the data they were trained on. Some people think they can repeat words without really grasping their meaning.
Two main views exist: One says LLMs can't understand language because they lack deeper meaning and intent, while the other argues that if they behave like they understand, then they might actually understand.
As LLMs become more advanced, we need to create better ways to test their understanding. This will help us figure out what it really means for a machine to 'understand' language.

Unveiling Backdoors, Optimized LLMs, and Spurious Patterns in AI

HackerPulse Dispatch • 8 implied HN points • 15 Nov 24

🕹 Technology Machine Learning Artificial Intelligence Cybersecurity Data science Software Development

Backdoors can be secretly added to machine learning models. These backdoors let bad actors change how the model makes decisions without being noticed.
Large Language Models (LLMs) are helpful for tuning model settings to make them work better. They can suggest and adjust configurations based on past performance.
Understanding spurious patterns in data is important. These patterns can confuse models and lead to mistakes, which is crucial for developing responsible AI systems.

What is the Story With Your Data? How To Make Sense of Your Stats

Data at Depth • 39 implied HN points • 31 Jan 24

🕹 Technology Data science Research Presentations

Data storytelling involves progressing from exploratory to explanatory research.
Brilliant data science researchers may not always be brilliant presenters.
It is important to make sense of data so that it can be effectively communicated to others.

UniMS-RAG: Unified Multi-Source RAG for Personalised Dialogue

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 30 Jan 24

🕹 Technology Artificial Intelligence Machine Learning Natural Language Data science Human-computer interaction

UniMS-RAG is a new system that helps improve conversations by breaking tasks into three parts: choosing the right information source, retrieving information, and generating a response.
It uses a self-refinement method that makes responses better over time by checking if the answers match the information found.
The system aims to make interactions feel more personalized and helpful, leading to smarter and more relevant conversations.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

A Guide for Building High-Quality Data Products

Inside Data by Mikkel Dengsøe • 16 implied HN points • 16 Jan 25

🕹 Technology Data science Product Development Quality Assurance Team Management Data Analytics

Start by clearly defining how you will use data. This helps set the purpose for your data products.
It's important to have clear ownership of data and understand what needs testing. This makes accountability easier.
Continuously monitor and improve your data quality. Regular reviews help catch issues early and keep trust in your data.

Edge 364: About COSP and USP: Two New LLM Reasoning Methods Built by Google Research

TheSequence • 133 implied HN points • 25 Jan 24

🕹 Technology AI Language Models Research Machine Learning Data science

Two new LLM reasoning methods, COSP and USP, have been developed by Google Research to enhance common sense reasoning capabilities in language models.
Prompt generation is crucial for LLM-based applications, and techniques like few-shot setup have reduced the need for large amounts of data to fine-tune models.
Models with robust zero-shot performance can eliminate the need for manual prompt generation, but may have less potent results due to operating without specific guidance.

Fine-Tuning OpenAI GPT-4o mini

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 2 HN points • 21 Aug 24

🕹 Technology AI Models Natural Language Machine Learning Data science Software Development

OpenAI's GPT-4o Mini allows for fine-tuning, which can help customize the model to better suit specific tasks or questions. Even with just 10 examples, users can see changes in the model's responses.
Small Language Models (SLMs) are advantageous because they are cost-effective, can run locally for better privacy, and support a range of tasks like advanced reasoning and data processing. Open-sourced options provide users more control.
GPT-4o Mini stands out because it supports multiple input types like text and images, has a large context window, and offers multilingual support. It's ideal for applications that need fast responses at a low cost.

Week #4: Overview Of Conformal Predictors

Mindful Modeler • 139 implied HN points • 10 Jan 23

🕹 Technology Machine Learning Data science Research Methods Applications

Conformal prediction is a versatile approach applicable to various machine learning tasks beyond just regression and classification.
When learning about a new conformal prediction method, it's important to consider the machine learning task, non-conformity score used, and how the method deviates from the standard recipe.
Staying up to date with new research in conformal prediction can be facilitated by resources like the 'Awesome Conformal Prediction' repository and following experts in the field on platforms like Twitter.

Data Design For Fine-Tuning To Improve Small Language Model Behaviour

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 17 Apr 24

🕹 Technology AI Development Machine Learning Data science Natural Language Processing Model Training

Small Language Models can be improved by designing their training data to help them reason and self-correct. This means creating special ways to present information that guide the model in making better decisions.
Two methods, Prompt Erasure and Partial Answer Masking (PAM), help models learn how to think critically and correct mistakes on their own. They get trained in a way that shows them how to approach problems without providing the exact questions.
The focus is shifting from just updating a model's knowledge to enhancing its behavior and reasoning skills. This means training models not just to recall information, but to understand and apply it effectively.

Vesuvius Challenge Progress Prizes: November Edition

Vesuvius Challenge • 14 implied HN points • 23 Jan 25

🕹 Technology Data science Software Development Open Source Machine Learning

Community members contributed a lot to the Vesuvius Challenge, earning prizes for their work. This shows how teamwork can lead to great progress!
Some projects focused on improving how we visualize 3D scrolls and extracting data from images. These tools could really help researchers understand ancient texts better.
Awards are given for various types of contributions, encouraging creativity and technical skills. It’s exciting to see different approaches being recognized in the community.

Retrieval-Augmented Generation (RAG) vs LLM Fine-Tuning

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 19 Jan 24

🕹 Technology AI Machine Learning Data science Natural Language Software Development

Retrieval-Augmented Generation (RAG) is great for adding specific context and making models easier to use. It's a good first step if you're starting with language models.
Fine-tuning a model provides more accurate and concise answers, but it requires more upfront work and data preparation. It can handle large datasets efficiently once set up.
Using RAG and fine-tuning together can boost accuracy even more. You can gather information with RAG and then fine-tune the models for better performance.

GPU-enabled Bioinformatics

LatchBio • 9 implied HN points • 06 Nov 24

🕹 Technology Bioinformatics Data science AI Development Cloud Computing

Bioinformatics is moving towards using GPUs to speed up data processing. This change can save a lot of time and money for researchers.
New molecular techniques generate massive amounts of data that take too long to analyze without faster systems. Using GPUs can make these processes much quicker, especially for large datasets.
There are now cloud platforms that make it easier to use GPU technology without needing special expertise or expensive hardware. This helps more teams access advanced analysis tools.

Top 10 Events 2024

Bytewax • 39 implied HN points • 18 Jan 24

🕹 Technology Tech Conferences Data science Machine Learning Python AI

Top tech conferences in 2024 focus on AI, data science, ML, and Python.
Events offer opportunities to learn, connect with peers, and expand skills.
Attendees benefit from valuable insights, networking, and community engagement.

🔥 LLMs on Fire 🔥

Sector 6 | The Newsletter of AIM • 19 implied HN points • 15 Apr 24

🕹 Technology AI Software Machine Learning Data science Open Source

OpenAI's GPT-4 Turbo is currently leading the chatbot rankings, but there are strong competitors like Anthropic's Claude 3 Opus and Gemini Pro from Google.
Cohere's Command R+ has also made its mark among the top models, showing that it can compete with big-name AI.
Exciting new models like Llama 3 and GPT-5 are set to launch soon, which could shake things up even more in the AI race.

How to Fine-Tune Your Own Mistral-7B

The Beep • 39 implied HN points • 14 Jan 24

🕹 Technology Machine Learning Natural Language Processing AI Models Data science Programming

You can fine-tune the Mistral-7B model using the Alpaca dataset, which helps the model understand and follow instructions better.
The tutorial shows you how to set up your environment with Google Colab and install necessary libraries for training and tracking the model's performance.
Once you prepare your data and configure the model, training it involves monitoring progress and adjusting settings to get the best results.

Synthetic Data In A Nutshell

Three Data Point Thursday • 39 implied HN points • 11 Jan 24

🕹 Technology Data science AI Data Engineering Machine Learning Software Development

Synthetic data is fake data that is becoming increasingly practical and valuable.
Generative AI and the growing gap between data demand and availability are driving forces for the usefulness of synthetic data.
Synthetic data is beneficial in various fields beyond just machine learning, offering opportunities for innovation and improvement.

January Newsletter

RSS DS+AI Section • 17 implied HN points • 01 Jan 25

🕹 Technology Data science Artificial Intelligence Machine Learning Ethics Applications

Data science and AI are rapidly evolving fields, with 2024 being a particularly exciting year for advancements. As we move into 2025, the trends and stories from last year will continue to shape the future.
Ethics in AI is a crucial topic that remains relevant, especially around issues like bias and safety. The way AI is developed and used needs careful consideration to align with human interests.
There are many practical applications and resources available for learning about data science and AI. From tutorials to real-world examples, there are plenty of opportunities to get involved and apply AI technologies.

Staying Human in the Age of Data.

The Future Does Not Fit In The Containers Of The Past • 20 implied HN points • 15 Dec 24

💼 Business Marketing Data science Human Resources Leadership Innovation

Data is important, but focusing too much on it can harm the long-term success of both businesses and people. It's crucial to balance numbers with human emotions and culture.
Leaders should encourage open discussions about tough topics and avoid wasting time in unnecessary meetings. This helps create a culture where everyone feels comfortable sharing their thoughts.
Successful companies need to remember that their employees are not just numbers. Investing in their development and well-being leads to a more motivated and productive workforce.

November Newsletter

RSS DS+AI Section • 29 implied HN points • 01 Nov 24

🕹 Technology Artificial Intelligence Data science Machine Learning Research Ethics

Data science and AI are constantly evolving, with new research and developments being released regularly. It's important to stay updated on these changes to understand their implications.
Ethics, bias, and regulation in AI continue to be hot topics. Discussions around how to handle these challenges are crucial for the responsible use of AI technologies.
There are many practical applications and resources available for those interested in implementing AI. Tips and how-to guides can help individuals and organizations make better use of these technologies.

My biggest mistakes as a data-science team lead

inexactscience • 59 implied HN points • 27 Oct 23

💼 Business Leadership Data science Team Management Performance Metrics

Leadership style should change based on each team member's skills and motivation. It's important to adjust how you lead as people grow and face new challenges.
Focusing only on problems can lead to neglecting high performers. Instead of constantly putting out fires, you should aim to create overall value in the team.
Using data to measure success in a team is crucial. Setting clear metrics helps you understand progress and ensure your efforts are effective.

Why Tech Giants Are Paying Millions for AI Training Data

Intuitive AI • 19 implied HN points • 22 Aug 24

🕹 Technology AI Data science Machine Learning Software Development

Tech companies are paying a lot for training data because it helps them improve their AI models. As AI use grows, high-quality data has become very valuable.
Having diverse and rich training data is crucial for AI to learn well. Just like a student needs various books to understand different subjects, AI needs various data to perform better.
Quality of the data matters even more than quantity. Rich, informative data leads to better AI outcomes, which is why companies are willing to spend big bucks on it.

DBRX: Revolutionizing Language Models for the Open Community

Data Plumbers • 19 implied HN points • 04 Apr 24

🕹 Technology Artificial Intelligence Machine Learning Open Source API Integration Data science

Language models like DBRX are crucial in AI, changing how we use technology from chatbots to code generation.
DBRX is an open-source alternative to closed models, providing high performance and accessibility to developers.
DBRX stands out for its top performance, versatility in specialized domains, efficiency in training, and integration capabilities.

AI in 2024 - Scandals, Security, and Sustainability!

aidaily • 39 implied HN points • 01 Jan 24

🕹 Technology AI Data science Medicine Music E-commerce

AI experts predict unpredictable future in 2024
The New York Times in legal battle over AI usage
AI revolutionizing medicine with new antibiotics

Edge 372: Learn About CALM, Google DeepMind's Method to Augment LLMs with Other LLMs

TheSequence • 98 implied HN points • 22 Feb 24

🕹 Technology Artificial Intelligence Machine Learning Data science Research

Knowledge augmentation is crucial in LLM-based applications with new techniques constantly evolving to enhance LLMs by providing access to external tools or data.
Exploring the concept of augmenting LLMs with other LLMs involves merging general-purpose anchor models with specialized ones to unlock new capabilities, such as combining code understanding with language generation.
The process of combining different LLMs might require additional training or fine-tuning of the models, but can be hindered by computational costs and data privacy concerns.

Databricks just dropped an LLM bomb – DBRX 💣

Sector 6 | The Newsletter of AIM • 19 implied HN points • 31 Mar 24

🕹 Technology AI Software Data science Open Source Machine Learning

Databricks has released a new powerful open-source language model called DBRX. It aims to outperform existing models in areas like reasoning, coding, and math.
DBRX has shown better performance than other popular models, including Meta’s LLaMA and Google's Gemini Pro. This showcases Databricks' advancements in AI technology.
The release is generating excitement in the AI community, highlighting the competitive landscape of language models and their capabilities.

The Tech Buffet #2: How To Use LangChain to Perform Question Answering Over Documents

The Tech Buffet • 59 implied HN points • 06 Sep 23

🕹 Technology Machine Learning Programming Software Development Data science Artificial Intelligence

You can use LangChain to build a question-answering system that works with documents. It helps you fetch answers from documents effortlessly.
The process involves loading a document, splitting it into manageable chunks, and then using these chunks to find answers. This way, you have context to support the answers generated.
It's important to keep experimenting and refining your system for better answers. Check out more details in the LangChain documentation for tips and improvements.

Prompt Engineering Lecture

Elvis's Blog • 58 implied HN points • 19 Feb 23

🕹 Technology Artificial Intelligence Coding Data science Education

The post is about a special lecture on prompt engineering techniques for language models.
The lecture is divided into four parts: Introduction to Prompt Engineering, Advanced Techniques, Tools and Applications, and Conclusion with Future Directions.
The post provides links to notebook, slides, and GitHub for further exploration.

Red Hot ChatGPT, AGI, and Translators- Lessons from my AI talk at USAFA

Mike Talks AI • 58 implied HN points • 27 Feb 23

🕹 Technology AI AGI Data science Artificial Intelligence

ChatGPT technology is creating excitement and interest.
It's important to educate about AGI and the different types of AI.
Highlight the various roles in a data science team, including the role of a translator.

GTC 4080 Giveaway

Bojan’s Newsletter • 58 implied HN points • 14 Mar 23

🕹 Technology GPU Event AI Data science

There is a giveaway for an Nvidia RTX 4080 Founders Edition for GTC Spring 2023
To participate, read the rules provided in a specific tweet
Check out the recommended sessions and important links for the event

ChatGPT and Supply Chain: Podcast Guest, Three Lessons, and Research Opportunities

Mike Talks AI • 58 implied HN points • 13 Jun 23

🕹 Technology AI Supply Chain Podcast Research Data science

Supply chain professionals can use ChatGPT as a 'loss leader' to educate leaders about AI's potential for supply chains.
ChatGPT can help supply chain teams build more AI algorithms by breaking down syntax barriers and expanding team capabilities.
Exploring how ChatGPT can turn vast supply chain data into valuable insights is an important research opportunity.

AI Coding Assistants and Copyrights

What's AI Newsletter by Louis-François Bouchard • 58 implied HN points • 16 May 23

🕹 Technology AI Data science Coding Generative AI Neural Networks

Generative AI in coding is revolutionizing code suggestion and completion.
AI coding assistants like Copilot and ChatGPT have strengths and limitations.
Careful use of AI coding assistants is crucial to avoid copyright issues and ensure code quality.

DRAGIN: Dynamic RAG Based On Real-Time Information Needs Of LLMs

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 26 Mar 24

🕹 Technology AI Machine Learning Natural Language Data science Software Development

Dynamic Retrieval Augmented Generation (RAG) improves the way information is retrieved and used in large language models during text generation. It focuses on knowing exactly when and what to look up.
Traditional RAG methods often use fixed rules and may only look at the most recent parts of a conversation. This can lead to missed information and unnecessary searches.
The new framework called DRAGIN aims to make data retrieval smarter and faster without needing further training of the language models, making it easy to use.

Prompt-RAG

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 20 Mar 24

🕹 Technology AI Machine Learning Data science Natural Language Processing Computing

Prompt-RAG is a new method that improves language models without using complex vector embeddings. It simplifies how we retrieve information to answer questions.
The process involves creating a Table of Contents from documents, selecting relevant headings, and generating responses by injecting context into prompts. It makes handling data easier.
While this method is great for smaller projects and specific needs, it still requires careful planning when constructing the documents and managing costs related to token usage.

Performing Multiple LLM Calls & Voting On The Best Result Are Subject To Scaling Laws

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 19 Mar 24

🕹 Technology AI Machine Learning Data science Systems Design Performance optimization

Making more calls to Large Language Models (LLMs) can help with simple questions but may actually make it harder to answer tough ones.
Finding the right number of calls to use is crucial for getting the best results from LLMs in different tasks.
It's important to design AI systems carefully, as just increasing the number of calls doesn't always mean better performance.

dbt Labs acquires Transform - who's next?

Data Thoughts • 119 implied HN points • 19 Feb 23

💼 Business Data science Analytics Acquisitions Market Trends Technology

dbt Labs has bought Transform, and more companies in the data field might be sold or closed soon. This could lead to big changes in the industry.
Data teams are seen as a 2nd order need for businesses, meaning they aren't absolutely necessary. Companies may cut these teams first when they need to save money.
To get the best value from tools, data practitioners should focus on essential needs rather than extra features. This means keeping an eye on what really matters in the data ecosystem.

Artificial intelligence and "big data" cannot replace public opinion polls | #213 - April 10, 2023

G. Elliott Morris's Newsletter • 119 implied HN points • 10 Apr 23

🇺🇸 U.S. Politics Polling Public Opinion Data science Political Analysis Social Trends

Artificial intelligence and big data cannot fully replace public opinion polls, as they rely on polls for calibration and may not be as reliable for all groups.
Changes in polling methods, like switching from phone to online surveys, can impact results, highlighting the importance of consistency over time.
Studies show genuine change in attitudes, like increasing racial liberalism, but also caution against biases affecting survey responses.

AIM Daily XO

Sector 6 | The Newsletter of AIM • 39 implied HN points • 05 Dec 23

🚌 Education Data science Rankings Online Learning Higher education

AIM has been ranking graduate programs for eight years, focusing on Data Science programs in India for 2023. They use surveys and research to create these rankings.
This year's rankings include both on-campus and online/hybrid postgraduate programs. This helps students find options that fit their learning style.
A strong program is one that scores well across various areas, showing its quality and value to students.

What Is AGI?

Gradient Ascendant • 1 implied HN point • 20 Jan 25

🕹 Technology Artificial Intelligence Machine Learning Robotics Software Development Data science

There are many definitions of AGI, but they can be quite different from each other. It's important to recognize that people might be talking about different things when they mention AGI.
AGI isn't just about intelligence; it's also about capabilities and outcomes. The effectiveness of AI solutions can be more important than how closely they mimic human thinking.
A practical way to define AGI is by comparing the economic performance of AI to human workers. This approach focuses on measurable results rather than vague qualities of intelligence.

LLMs Training SLMs

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 12 Mar 24

🕹 Technology Artificial Intelligence Machine Learning Data science Natural Language Model Training

Orca-2 is designed to be a small language model that can think and reason by breaking down problems step-by-step. This makes it easier to understand and explain its thought process.
The training data for Orca-2 is created by a larger language model, focusing on specific strategies for different tasks. This helps the model learn to choose the best approach for various challenges.
A technique called Prompt Erasure helps Orca-2 not just mimic larger models but also develop its own reasoning strategies. This way, it learns to think cautiously without relying on direct instructions.