The hottest Data science Substack posts right now

And their main takeaways

The Tech Buffet #23: What Nobody Tells You About RAGs

The Tech Buffet • 1 HN point • 22 Aug 24

It's important to understand the business needs before jumping into building a Retrieval-Augmented Generation (RAG) system. Knowing the user's context and how they will use the system will save time and improve outcomes.
Different types of data need to be indexed in specific ways for a RAG to work effectively. This means treating text, images, tables, and code differently to maximize the system's performance.
The quality of the data chunks you use significantly affects the answers generated by a RAG. Taking the time to create clear, relevant chunks will lead to better responses from the system.

MultiHop-RAG

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 31 Jan 24

🕹 Technology AI Machine Learning Data science Software Development Language Models

Multi-hop retrieval-augmented generation (RAG) helps answer complex questions by pulling information from multiple sources. It connects different pieces of data to create a clear and complete answer.
Using a data-centric approach is becoming more important for improving large language models (LLMs). This means focusing on the quality and relevance of the data to enhance how models learn and generate responses.
The development of prompt pipelines in RAG systems is gaining attention. These pipelines help organize the process of retrieving and combining information, making it easier for models to handle text-related tasks.

Tech Talks Weekly #4

Tech Talks Weekly • 19 implied HN points • 19 Feb 24

🕹 Technology Software Development Web Development Data science Cloud Computing Programming Languages

The newsletter summarizes recent tech talks from various conferences, making it easier for readers to find valuable content. It's a great resource for anyone interested in technology.
Each issue features a selection of must-watch talks, along with a list of new uploads categorized by conference. This helps viewers easily discover trending topics in tech.
Readers are encouraged to provide feedback on the newsletter format and share it with friends or colleagues to grow the community. It's all about connecting more people to interesting tech discussions.

Visualise & Discover RAG Data

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 23 Jan 24

🕹 Technology AI Data science Software Development Machine Learning Data Visualization

RAGxplorer is a tool that helps visualize and explore data chunks, making it easier to understand how they relate to different topics.
The process of Retrieval-Augmented Generation (RAG) involves breaking documents into smaller chunks to improve how data is retrieved and used with language models.
Visualizing data can help identify problems like missing information or unexpected results, allowing users to refine their questions or understand their data better.

LangSmith by LangChain

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 22 Jan 24

🕹 Technology AI Tools Software Development Data science Machine Learning Programming Languages

LangSmith helps organize and manage projects and data for applications built with LangChain. It allows you to see your tasks in a neat layout and check performance easily.
The platform offers tools for testing and improving agents, especially when handling multiple tasks at the same time. This helps ensure that applications run smoothly.
LangSmith allows users to create datasets that can improve agent performance. It also has features to evaluate how well agents are doing by comparing their outputs to expected results.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Popular LLM Datasets

The Beep • 19 implied HN points • 21 Jan 24

🕹 Technology Machine Learning Data science Artificial Intelligence Natural Language Processing Software Development

Datasets are crucial for training machine learning models, including language models. They help the model learn patterns and make predictions.
Popular sources for datasets include Project Gutenberg and Common Crawl, which provide large amounts of text data for training language models.
Instruction tuning datasets are used to adapt pre-trained models for specific tasks. These help the model perform better in given situations or instructions.

OpenAI isn’t Exciting Any Longer

Sector 6 | The Newsletter of AIM • 39 implied HN points • 27 Jun 23

🕹 Technology AI Machine Learning Data science Software Development Tech industry

OpenAI is losing talented employees to Google, indicating a shift in the competitive landscape of AI.
Some former OpenAI staff are unhappy with leadership, feeling that the company's vision is too focused on ChatGPT.
There are concerns about the lack of direction at OpenAI, with rumors about the CEO's understanding of the business being superficial.

Unearthing Datasets Preparation for LLM

The Beep • 19 implied HN points • 11 Jan 24

🕹 Technology AI Machine Learning Data science Natural Language

Good datasets are really important for training large language models (LLMs). If the data isn't well prepared, the model won't perform well.
To prepare a dataset, you need to gather data, clean it up, and then convert it into a format the model can understand. Each step is crucial.
While training LLMs, it's important to think about issues like data bias and privacy. This can affect how well the model works and who it might unfairly impact.

December Newsletter

RSS DS+AI Section • 11 implied HN points • 01 Dec 24

🕹 Technology AI Data science Machine Learning Ethics Research

There are ongoing discussions about the ethical use of AI, especially in healthcare and military. It’s important to think about privacy and the implications of these technologies.
New developments in data science and AI research are exciting, such as improved models for training and reasoning. It's a fast-paced field with many recent breakthroughs.
Generative AI is evolving quickly, with many companies working on new models and applications. This includes features like AI-generated summaries of content you're watching.

Active Prompting with Chain-of-Thought for Large Language Models

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 05 Jan 24

🕹 Technology AI Machine Learning Data science Natural Language Automation

AI can help improve language models by using a four-step process: estimating uncertainty, selecting uncertain questions, annotating them, and making final inferences. This helps ensure better answers.
Using human annotations along with AI makes the training data clearer and reduces confusion. It allows us to focus on the most important information for the models.
Companies can benefit from this approach by streamlining how they handle data. It promotes a more organized way of discovering, designing, and developing data.

How to Create Good Documentation in Software Engineering and Tech[Technique Tuesdays]

Technology Made Simple • 59 implied HN points • 19 Oct 22

🕹 Technology Software Engineering Data science Machine Learning Documentation Tech

Good documentation in software engineering is crucial as it provides clarity to the team about goals and work done, enhancing productivity.
Key pillars of good documentation include having a vision for the company and products, outlining resource/situational constraints, detailing data sources and processing, tracking projects in progress, sharing actual code, and establishing ownership.
Benefits of good documentation in tech include aligning teams, clarifying vision and plans, reducing onboarding time, and promoting asynchronicity in an increasingly remote working environment.

Can one black box explain another?

The Counterfactual • 39 implied HN points • 29 May 23

🕹 Technology AI Machine Learning Neural Networks Data science Computational Models

Large language models (LLMs) like GPT-4 are often referred to as 'black boxes' because they are difficult to understand, even for the experts who create them. This means that while they can perform tasks well, we might not fully grasp how they do it.
To make sense of LLMs, researchers are trying to use models like GPT-4 to explain the workings of earlier models like GPT-2. This involves one model generating explanations about the neuron activations of another model, aiming to uncover how they function.
Despite the efforts, current methods only explain a small fraction of neurons in these LLMs, which indicates that more research and new techniques are needed to better understand these complex systems and avoid potential failures.

Welcome to my Substack!

Amgad’s Substack • 19 implied HN points • 22 Dec 23

🕹 Technology AI Machine Learning Data science Large Language Models

The Substack focuses on machine learning, data science, and AI.
Expect in-depth articles, case studies, opinion pieces, and curated resources about the latest advancements in AI.
Readers are encouraged to subscribe, engage, and follow on social media for a more interactive experience.

New tools to use with new scroll

Vesuvius Challenge • 10 implied HN points • 27 Nov 24

🕹 Technology Open Source Data science Machine Learning Community projects Innovation

The Vesuvius Challenge has introduced new tools to help with studying ancient scrolls. These tools are meant to improve our understanding of scrolls found in Herculaneum.
There is a total of $18,500 available as prizes for community contributions. The rewards are aimed at motivating open-source work that supports the reading and analysis of the new scroll dataset.
Several contributors have developed techniques and tools for better image segmentation and data analysis of scrolls. These advancements help make the process of interpreting ancient texts easier and more accurate.

How do we make sense of all of this?

Sunday Letters • 39 implied HN points • 18 Jun 23

🕹 Technology AI Data science Workplace Innovation Complex Systems

It's normal to feel overwhelmed with all the rapid changes in technology and AI. Many people are struggling to keep up, and that's okay.
Using first principles can help us find clarity in confusing situations. Focusing on what's truly important and how things work can guide our understanding.
Looking at data and history can help us make sense of current trends. By finding patterns and using math, we can better understand the complexities of new technologies.

February Newsletter

RSS DS+AI Section • 5 implied HN points • 01 Feb 25

🕹 Technology AI Data science Research Applications Ethics

AI and Data Science are rapidly evolving fields with new projects and innovations popping up all the time. It's important to stay updated with the latest research and applications.
Ethics in AI is a huge concern, with ongoing discussions about bias, privacy, and the regulation of AI technology. People are looking for ways to use AI responsibly.
There's a growing demand for skilled professionals in AI, particularly in areas like AI Product Management, which is becoming a hot job opportunity.

Did OpenAI create the best weapon detection software available with ChatGPT-4o?

School Shooting Data Analysis and Reports • 4 HN points • 04 Jun 24

🕹 Technology AI Security Data science Software Development

AI weapon detection software struggles to differentiate between weapons and weapon-shaped objects like umbrellas or sticks, leading to issues in accuracy and efficiency.
OpenAI's ChatGPT-4o offers more advanced weapon detection capabilities from image analysis compared to current market options, recognizing context better.
ChatGPT-4o was successful in identifying guns and gun-like objects in various scenarios, showcasing a high level of performance in image classification and context understanding.

2023 Wrap up

RSS DS+AI Section • 53 implied HN points • 31 Dec 23

🕹 Technology Data science Artificial Intelligence Events Newsletter

The focus for the year was 'Effective and Efficient Data Science' to highlight the critical aspects of the field beyond hype.
Various events and discussions were held throughout the year to promote best practices in Data Science.
Engagement with the community through events, surveys, and articles was emphasized to ensure diverse voices are heard in influencing policy.

Practical Reinforcement Learning and Differential Privacy

Gradient Flow • 119 implied HN points • 17 Feb 22

🕹 Technology Machine Learning Data science Infrastructure Privacy

The ratio of data scientists to data engineers varies based on factors like tools, infrastructure, and use cases, with no set ideal ratio.
Interesting developments include a new podcast discussing machine learning infrastructure at Netflix, imperceptible NLP attacks, and evolving data science training programs.
Exciting tools and updates in the data and machine learning space, like practical reinforcement learning applications, scalable differential privacy for Python developers, and the Orbit version 1.1 for Bayesian time-series analysis.

🚀 Keeping it Real (as usual)

Data Science at Home • 19 implied HN points • 03 Dec 23

🕹 Technology Data science Podcasts Artificial Intelligence Website Design Community Building

Data Science at Home focuses on real-world applications without hype or drama.
The podcast kept its original name to stay true to its core identity and values.
The website refresh includes a sleek design, fresh episodes, interactive features, and a community section.

Data Storytelling: 5 Simple Tips With Python Plotly to Spice Up Your Data

Data at Depth • 19 implied HN points • 03 Dec 23

🕹 Technology Data science Python Visualization

Data storytelling is crucial for making data accessible and interesting.
Python's Plotly library is a powerful tool for visualizing data sets in meaningful ways.
Consider supporting publications like Data at Depth for more insightful content.

Newsletter #1 - Welcome to Data at Depth!

Data at Depth • 19 implied HN points • 01 Dec 23

🕹 Technology Data science Computer Science Artificial Intelligence Machine Learning Newsletter

The newsletter 'Data at Depth' aims to explore topics in computer science and data analytics, sharing insights from a professor with 20+ years of experience in the field.
The constant growth and exploration in the world of AI-generated data leaves many individuals curious and on a learning journey.
Readers can subscribe to Data at Depth for a 7-day free trial to access full post archives and continue learning about data and computer science topics.

The Data Hierarchy of Needs

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Nov 23

🕹 Technology Data Engineering Data science Data Transformation

Data can be structured in a hierarchy similar to Maslow's Hierarchy of Needs, where each level is necessary for the enjoyment of the level above it. This concept applies to data engineering pipelines.
Data pipelines are crucial for deriving business value, even if they are complex and not directly visible. Architectural considerations and infrastructure choices play a significant role in making data a priority in a business.
When considering data infrastructure, such as data ingestion tools, cloud warehouses, BI tools, and others, it's important to plan the entire stack and not just jump to specific infrastructure. Consider aspects like version control, security, integration, and orchestration.

How to implement a decision tree

The Palindrome • 3 implied HN points • 08 Nov 24

🕹 Technology Machine Learning Data science Programming Artificial Intelligence Software Development

A decision tree splits data based on features and thresholds, which helps in making predictions by creating branches. Each split leads to two outcomes based on whether the condition is met or not.
Gini impurity is a key measure for evaluating how 'pure' the labels are in each leaf of the tree. A lower Gini impurity means better predictability for a leaf's classification.
You can create both classification and regression trees by changing how you score the splits and define the predictions in the leaves. This flexibility allows for various applications in data analysis.

The Bias vs Variance Tradeoff [Math Mondays]

Technology Made Simple • 39 implied HN points • 06 Dec 22

🕹 Technology Data science Machine Learning Statistics Deep Learning Tech Education

Understanding the Bias-Variance Tradeoff is crucial in Data Science and Machine Learning.
Bias in a Machine Learning Model refers to prediction errors, while Variance accounts for the spread in predictions.
High Bias can lead to underfitting, where the model doesn't grasp the data pattern fully, while High Variance can result in overfitting, where the model learns noise in the data.

Chain-Of-Knowledge Prompting

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 22 Nov 23

🕹 Technology AI NLP Machine Learning Data science Natural Language

Chain-Of-Knowledge (CoK) prompting is a useful technique for complex reasoning tasks. It helps make AI responses more accurate by using structured facts.
Creating effective prompts using CoK requires careful construction of evidence and may involve human input. This is important for ensuring the quality and reliability of the information AI generates.
The CoK approach aims to reduce errors or 'hallucinations' in AI responses. It offers a more transparent way to build prompts and enhances the overall reasoning ability of AI systems.

Why Talking Models are not going to take your jobs [Math Mondays]

Technology Made Simple • 39 implied HN points • 29 Nov 22

🕹 Technology AI Data science Engineering Programming Machine Learning

Models processing inputs use vectors to represent features, not replacing people
Comparing similarity between data points helps models generate answers efficiently
Big models have limitations in working with new inputs and face engineering challenges at scale

AI Week That Was

Sector 6 | The Newsletter of AIM • 39 implied HN points • 19 Mar 23

🕹 Technology Artificial Intelligence Software Development Data science Machine Learning Tech Trends

Alpaca 7B is a new AI model introduced by Stanford that performs well, similar to OpenAI's models, but is smaller and cheaper to use.
The AI landscape is buzzing with exciting developments and new models, making it an interesting time for AI enthusiasts.
The week highlights a range of impressive AI technologies, signaling that there's much more innovation to come in this field.

How Should Large Language Models Be Evaluated?

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 06 Nov 23

🕹 Technology AI Machine Learning Data science Natural Language Processing Evaluation

When evaluating large language models (LLMs), it's important to define what you're trying to achieve. Know the problems you're solving so you can measure success and failure.
Choosing the right data is crucial for evaluating LLMs. You'll need to think about what data to use and how it will be delivered in your application.
The process of evaluation can be automated or involve human input. Deciding how to implement this process is key to building effective LLM applications.

The Most Common Data Science Interview Mistake

inexactscience • 39 implied HN points • 14 Mar 23

🚌 Education Interview Tips Data science Problem Solving Professional development

One big mistake in data science interviews is jumping to solutions too quickly. It's important to first understand the problem before trying to solve it.
Asking questions during the interview can show your insight and help you gather essential information. It helps to clarify the business context and what needs to be addressed.
Finding a balance is key. You want to ask enough questions to understand the issue without getting stuck in overthinking. A good candidate knows when to seek clarification and when to respond directly.

AI, is it Logic or Magic?

The Novice • 19 implied HN points • 26 Oct 23

🕹 Technology AI Data science Machine Learning Algorithms Artificial Intelligence

AI is based on statistics and massive data processing, not magic.
AI mimics human-like thought processes through algorithms and machine learning techniques.
Understanding AI involves complex details and processes beyond human perception.

The dbt meta tag

Data Thoughts • 59 implied HN points • 25 Nov 22

🕹 Technology Data science Software Development Open Source Funding

The dbt meta tag helps document important info about data models. It's a simple way to keep track of data governance like ownership and sensitivity.
Many companies have used the dbt meta tag to enhance their products. Some of these companies have received significant venture capital funding because of these improvements.
Documenting tools and their funding related to the dbt meta tag can inspire others. It shows how small features can lead to big opportunities.

99% of people just get AI wrong...

do clouds feel vertigo? • 39 implied HN points • 25 Mar 23

🕹 Technology Artificial Intelligence Machine Learning Data science Programming Innovation

Microsoft claims that GPT-4 shows potential for Artificial General Intelligence, but some critics doubt its transparency and reliability, feeling it's more of a marketing claim than factual science.
Generative AI models can produce creative outputs but shouldn't be judged like traditional knowledge tools. They often generate believable yet false information, showcasing a need for a different evaluation standard.
As AI technology evolves, the cost to create content is decreasing, which raises questions about who will really profit from it and how existing knowledge can be effectively leveraged in this new landscape.

On TikTok, The Algorithm Optimizes YOU

Never Met a Science • 55 implied HN points • 31 May 23

🕹 Technology Social media Algorithm Data science Simulation Machine Learning

TikTok's algorithm shapes content creators' behavior based on feedback and viral success.
The algorithm aims to keep both creators and consumers engaged, but risks leading to repetitive content.
Data science and algorithms in platforms like TikTok create simplified simulations of reality for optimization, focusing on subjective metrics.

"Algorithmic entombment", explore-exploit trade-offs, and serendipity

The Counterfactual • 59 implied HN points • 04 Oct 22

🕹 Technology Algorithms Media Recommendations User Experience Data science

Recommendation systems can help us find new favorites but also risk making our choices repetitive. If we're only shown what we already like, we might miss out on discovering exciting new things.
There's a balance between exploring new options and sticking to what we know. Too much of either can lead to boredom or discomfort, so it’s important to mix both approaches in our choices.
Serendipity, or those happy accidents that lead to great moments, can be lost with strict recommendation systems. Sometimes the best experiences come from unexpected encounters, not just from things we already enjoy.

Updated: Emerging RAG & Prompt Engineering Architectures for LLMs

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 18 Oct 23

🕹 Technology Artificial Intelligence Machine Learning Natural Language Processing Data science Software Development

Large Language Models (LLMs) rely on both input and output data that are unstructured and conversational. This means they process language in a natural, free-flowing manner.
Fine-tuning LLMs has become less popular because it requires a lot of specific training and can get outdated. Using contextual prompts at the right time is a better way to improve their accuracy.
New tools are emerging that test different LLMs against prompts instead of just tweaking prompts for one LLM. This helps in finding the best model suited for different tasks.

Temporal degradation framework and other ideas

Santiago and the ML Models • 19 implied HN points • 05 Jun 23

🔬 Science Data science Machine Learning Model Evaluation

The author is working on a Temporal Model Degradation Framework for AI models.
They have implemented an experiment with early results showing model performance degradation over time.
The author plans to conduct a Continuous Retraining Experiment to test if continuous retraining can prevent model degradation.

Premium JC update

The Jolly Contrarian • 19 implied HN points • 14 Aug 23

🚌 Education Legal Documentation Essays Data science Organizational Change

Premium JC update includes progress on premiumizing ISDA and Equity Derivatives Definitions
Consolidated anatomy of emissions trading documentation is in the works under ISDA, EFET, and IETA
JC Essays explore themes like form versus substance, system redundancy, and pace layering

The Birth of Baby Llama

Sector 6 | The Newsletter of AIM • 19 implied HN points • 25 Jul 23

🕹 Technology AI Deep Learning Software Cloud Computing Data science

Andrej Karpathy worked on a fun project to create a smaller version of the Llama 2 model called Baby Llama. It's designed to run on a single computer.
The Baby Llama can load and use the models released by Meta, making it more accessible for users.
Karpathy shared that the performance is promising, with potential for faster processing speeds on a cloud setup.

Bayesian Thinking for Software Engineering [Math Mondays]

Technology Made Simple • 59 implied HN points • 03 May 22

🕹 Technology Software Engineering Data science Deep Learning Mathematics

Bayes Theorem allows us to update beliefs based on evidence, crucial for software developers making decisions.
Bayesian Thinking is implicit in many decisions we make, and recognizing its importance can prevent fallacies.
Learning Bayesian Thinking involves understanding intuition behind the math, using resources like StatsQuest and 3Blue1Brown.