The hottest Data science Substack posts right now

And their main takeaways

The Sequence Chat: Arjun Sethi on Venture Investing in Generative AI

TheSequence • 1211 implied HN points • 10 Jan 24

🕹 Technology AI Venture Capital Generative AI Data science Enterprise Software

Tribe Capital uses data science and AI for successful venture capital performance.
Successful investments in generative AI focus on product-market fit and distribution advantages.
The future of generative AI will see coexistence of open-source and closed-source distribution models.

Data Science Weekly - Issue 526

Data Science Weekly Newsletter • 419 implied HN points • 22 Dec 23

🕹 Technology Data science AI Machine Learning Analytics Software Development

Generative AI is changing how we work with tools, improving the Human-Tool Interface. This can help us use technology in ways we never could before.
Support Vector Machines (SVMs) can be very effective for prediction tasks, often outperforming other models in error rates. However, they aren’t as commonly used, possibly due to their complexity.
Deep multimodal fusion is useful in surgical training. It helps classify feedback from experienced surgeons to trainees by combining different types of data like text, audio, and video.

How to sell bread with quantile regression

Mindful Modeler • 339 implied HN points • 23 Jan 24

🕹 Technology Machine Learning Data science Modeling Predictions

Quantile regression can be used for robust modeling to handle outliers and predict tail behavior, helping in scenarios where underestimation or overestimation leads to loss.
It is important to choose quantile regression when predicting specific quantiles, such as upper quantiles, for scenarios like bread sales where under or overestimating can have financial impacts.
Quantile regression can also be utilized for uncertainty quantification, and combining it with conformal prediction can improve coverage, making it useful for understanding and managing uncertainty in predictions.

What Does Hitting Scaling Law Limit Mean for US-China AI Competition

Interconnected • 246 implied HN points • 18 Nov 24

🕹 Technology AI Machine Learning International relations Global Competition Data science

The scaling law for AI models might be losing effectiveness, meaning that simply using more data and compute power may not lead to significant improvements like it did before.
US export controls on AI technology may become less impactful over time, as diminishing returns on AI model scaling could lessen the advantages of having the most advanced hardware.
If AI development slows down, the urgency for a potential 'AI doomsday' scenario may decrease, allowing for a more balanced competition between the US and China in AI advancements.

Languages don't need many words

John Ball inside AI • 39 implied HN points • 24 Jul 24

🕹 Technology Machine Learning Natural Language Artificial Intelligence Data science Speech Recognition

You don't need many words to communicate in a new language. Just a small vocabulary can help you get by in everyday conversations.
For understanding most spoken and written text, around 2000 words are usually enough. This covers about 80% of regular communication.
Machine learning and AI can benefit from understanding language like humans do, by learning new words in context rather than just relying on a large vocabulary.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

💥 Tech Talks Weekly #24

Tech Talks Weekly • 59 implied HN points • 26 Jul 24

🕹 Technology Conferences Programming Software Development AI Data science

Tech Talks Weekly is a free email newsletter that shares recent talks from dozens of tech conferences. It's a great way to catch up on what you missed!
Readers can participate by filling out a short form to help improve the content. This makes it a community-driven resource.
The newsletter highlights popular talks each week, making it easier for people to discover valuable insights from experts in tech.

Five Levels Of AI Agents

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 119 implied HN points • 16 May 24

🕹 Technology AI Software Machine Learning Automation Data science

AI agents can make decisions and take actions based on their environment. They operate at different levels of complexity, with level one being simple rule-based systems.
Currently, AI agents are improving rapidly, sitting at levels two and three, where they can automate tasks and manage sequences of actions effectively.
The future of AI agents is bright, as they will be more integrated into various industries, but we need to consider issues like accountability and ethics when designing and implementing them.

Data Science Weekly - Issue 545

Data Science Weekly Newsletter • 139 implied HN points • 03 May 24

🕹 Technology Data science Artificial Intelligence Machine Learning Software Development Big Data

Reusing data analysis work can save time and help teams focus on building new capabilities instead of just repeating old ones.
Open-source models can be a better choice than proprietary ones for developing AI applications, making them cheaper and faster.
Causal machine learning helps predict treatment outcomes by personalizing clinical decisions based on individual patient data.

How machines learn, visually explained 🧠

Year 2049 • 13 implied HN points • 17 Jan 25

🕹 Technology AI Machine Learning Data science Robotics Automation

AI systems learn from data, so the quality of that data is really important. Better data means smarter machines.
Machines can become biased if they are trained on biased data. It's important to watch out for this when developing AI.
This is just one part of a series explaining AI. More episodes will cover different aspects of how machines learn and behave.

📝 Guest Post: Advanced RAG Techniques: Bridging Text and Visuals for More Accurate Responses*

TheSequence • 175 implied HN points • 09 Dec 24

🕹 Technology AI Data science Machine Learning Software Development Information Retrieval

RAG techniques combine the power of language models with external data to improve accuracy. This means AI can give better answers by using real-world information.
Advanced methods like Small to Slide RAG make it easier for AI to work with visual data, like slides and images. This helps AI understand complex information that is not just text.
ColPali is a new approach that focuses on visuals directly, avoiding mistakes from converting images to text. It's useful for areas like design and technical documents, ensuring important details are not missed.

Large language models, explained with a minimum of math and jargon

The Counterfactual • 599 implied HN points • 28 Jul 23

🕹 Technology Artificial Intelligence Machine Learning Data science Natural Language Processing Software Development

Large language models, like ChatGPT, work by predicting the next word based on patterns they learn from tons of text. They don’t just use letters like we do; they convert words into numbers to understand their meanings better.
These models handle the many meanings of words by changing their representation based on context. This means that the same word could have different meanings depending on how it's used in a sentence.
The training of these models does not require labeled data. Instead, they learn by guessing the next word in a sentence and adjusting their processes based on whether they are right or wrong, which helps them improve over time.

Data Science Weekly - Issue 546

Data Science Weekly Newsletter • 119 implied HN points • 10 May 24

🕹 Technology Data science AI Machine Learning Data Engineering Data Visualization

Time-series analysis and Gaussian processes are powerful tools for interpreting data. They allow for flexibility and control in modeling data, making them essential for data practitioners.
Understanding A/B testing is crucial for making informed business decisions. Using a reliable experimentation system can save time and lead to better results.
New advancements in AI and data science are enhancing applications in various fields, like biomedical research and recommendation systems. These innovations help combine human creativity with machine learning capabilities.

Open LLMs don’t need to beat OpenAI

The AI Frontier • 119 implied HN points • 09 May 24

🕹 Technology AI Machine Learning Open Source Software Development Data science

Open LLMs, like Llama 3, are getting really good and can perform well in many tasks. This improvement makes them a strong option for various applications.
Fine-tuning open LLMs is becoming more attractive because of their improved quality and lower costs. This means smaller, specialized models can be more easily developed and used.
However, open models likely won't surpass OpenAI's offerings. The proprietary models have a big advantage, but open LLMs can still thrive by focusing on efficiency and specific use cases.

Data Science Weekly - Issue 540

Data Science Weekly Newsletter • 179 implied HN points • 29 Mar 24

🕹 Technology Data science AI Machine Learning Data Engineering Automation

SQL is seen as an easier way to write relational algebra, but it's not ideal for building new query tools. Understanding its limits can help in learning and using SQL better.
Many successful companies have developed their own AI models, showing a trend in the tech industry. Knowing about these companies can give insights into future developments in AI.
Binary vector search methods can save a lot of memory compared to traditional methods. However, it's important to balance memory savings with maintaining accuracy.

RAG Implementations Fail Due To Insufficient Focus On Question Intent

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 18 Jul 24

🕹 Technology AI NLP Data science Machine Learning Knowledge Graphs

Large Language Models (LLMs) can create useful text but often struggle with specific knowledge-based questions. They need better ways to understand the question's intent.
Retrieval-augmented generation (RAG) systems try to solve this by using extra knowledge from sources like knowledge graphs, but they still make many mistakes.
The Mindful-RAG approach focuses on understanding the question's intent more clearly and finding the right context in knowledge graphs to improve answers.

Triplex — a SOTA LLM for Knowledge Graph Construction

Owen’s Substack • 59 implied HN points • 19 Jul 24

🕹 Technology AI Machine Learning Data science Open Source Software Development

Triplex is a new tool that helps create knowledge graphs quickly and cheaply. It's much cheaper to use than older methods, making it easier for more people to utilize.
This tool is small enough to run on regular laptops, which means you don't need powerful computers to build knowledge graphs. This makes technology more accessible to everyone.
Triplex is open-source, allowing anyone to use and improve it. The community can experiment with it freely and innovate new ways to organize and understand information.

Behind The Screens Of Data Science At DoorDash | Daniel Parris

Data Analysis Journal • 294 implied HN points • 24 Jan 24

🕹 Technology Data science Analytics AI Consultancy Newsletter

AI will fundamentally change data science by automating tasks and emphasizing building AI models
Consider the end user when launching a project to avoid overlooking usability issues
Start with quality data; building fancy models isn't as important as clean, workable data for end users

Data Science Weekly - Issue 538

Data Science Weekly Newsletter • 199 implied HN points • 14 Mar 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Cloud Computing

Serverless computing can handle big tasks without limits, but it also brings challenges like managing large uploads effectively.
Art careers can be influenced by the reputation of institutions, with established artists facing less access to elite spaces early on compared to newcomers.
Learning about LLM evaluation metrics can help improve understanding and performance when working with large language models.

What AI can do with a toolbox... Getting started with Code Interpreter

One Useful Thing • 1338 implied HN points • 07 Jul 23

🕹 Technology AI Data Analysis Automation Machine Learning Data science

Code Interpreter by OpenAI democratizes data analysis with advanced AI tools
Code Interpreter decreases errors by working directly with Python code
Code Interpreter allows for versatile problem-solving with AI writing Python code

Introducing the Kirsch Cumulative Outcomes Ratio (KCOR) analysis: A powerful yet simple new technique for accurately assessing the impact of any intervention on any outcome

Steve Kirsch's newsletter • 6 implied HN points • 18 May 25

🏥 Health Politics Public Health Vaccination Epidemiology Data science

The KCOR method is a new, simple technique to analyze how different interventions, like vaccines, affect outcomes such as mortality. It uses basic data like date of birth, date of death, and vaccination date to provide clear results.
The analysis suggests that COVID vaccines may have increased mortality rates, indicating the vaccines could be more harmful than helpful. This counters many previous claims about the vaccines saving lives.
KCOR is designed to be objective and straightforward, allowing for accurate comparisons without needing complex data adjustments, making it a powerful tool for understanding health interventions.

($) DeepSeek's Three Idiosyncratic Advantages

Interconnected • 138 implied HN points • 03 Jan 25

🕹 Technology AI Open Source Machine Learning Global Competition Data science

DeepSeek-V3 is an AI model that is performing as well or better than other top models while costing much less to train. This means they're getting great results without spending a lot of money.
The AI community is buzzing about DeepSeek's advancements, but there seems to be less excitement about it in China compared to outside countries. This might show a difference in how AI news is perceived globally.
DeepSeek has a few unique advantages that set it apart from other AI labs. Understanding these can help clarify what their success means for the broader AI competition between the US and China.

Auto-Optimized Prompts, AI Text Detection, and Parametric RAG

HackerPulse Dispatch • 5 implied HN points • 31 Jan 25

🕹 Technology AI Machine Learning Data science Automation Innovation

LLM-AutoDiff can make AI workflows more efficient by automatically optimizing prompts, leading to better performance without the need for manual work.
Racing for superintelligence might cause more problems than it solves, making cooperation between nations a better option.
Combining reinforcement learning with transformers can create AI that adapts and solves new problems effectively over time.

We need better LLM evaluations

The AI Frontier • 159 implied HN points • 04 Apr 24

🕹 Technology AI Software Machine Learning Data science Internet

Current methods for evaluating language models (LLMs) are not effective because they try to give one-size-fits-all answers. Each LLM is better suited for different tasks, so we need evaluations that reflect that.
It’s important to look at specific skills of LLMs, like how well they follow instructions or retrieve information. This will help users understand which model works best for their needs.
We need more detailed benchmarks that assess individual capabilities rather than general performance scores. This way, developers can make smarter choices when selecting LLMs for their projects.

RAG Foundry By Intel

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 13 Aug 24

🕹 Technology Artificial Intelligence Software Development Open Source Data science Machine Learning

RAG Foundry is an open-source framework that helps make the use of Retrieval-Augmented Generation systems easier. It brings together data creation, model training, and evaluation into one workflow.
This framework allows for the fine-tuning of large language models like Llama-3 and Phi-3, improving their performance with better, task-specific data.
There is a growing trend in using synthetic data for training models, which helps create tailored datasets that match specific needs or tasks better.

The Sequence Knowledge #473: Not All RAGs are Created Equal

TheSequence • 98 implied HN points • 21 Jan 25

🕹 Technology Machine Learning Artificial Intelligence Data science Information Retrieval Natural Language Processing

RAG stands for Retrieval Augmented Generation. It's a way for machines to pull in outside information, helping them give better and more accurate answers.
There are many kinds of RAG, like Standard RAG and Fusion RAG. Each type helps machines deal with different problems and has its special strengths.
Understanding these RAG types is important for anyone working in AI. It helps them choose the right approach for different challenges.

Data Science Weekly - Issue 525

Data Science Weekly Newsletter • 359 implied HN points • 15 Dec 23

🕹 Technology Data science AI Machine Learning Data Engineering Data Privacy

Learning about causal models is important in data analysis because it helps explain what caused the data. This understanding can improve how we interpret results using Bayesian methods.
There's growing concern over data privacy in AI tools like Dropbox. Users are worried their private files could be used for AI training, even though companies deny this.
Netflix recently held a Data Engineering Forum to share best practices. They discussed ways to improve data pipelines and processing, which could benefit many in the data engineering community.

The Sequence Knowledge #482: An Introduction to Corrective RAG

TheSequence • 77 implied HN points • 04 Feb 25

🕹 Technology Artificial Intelligence Machine Learning Data science Software Development Information Systems

Corrective RAG is a smarter way of using AI that makes it more accurate by checking its work. It helps prevent mistakes or errors in the information it gives.
This method goes beyond basic retrieval-augmented generation (RAG) by adding feedback loops that refine and improve the output as it learns.
The goal of Corrective RAG is to provide answers that are factually accurate and coherent, reducing confusion or incorrect information.

Data Science Weekly - Issue 542

Data Science Weekly Newsletter • 139 implied HN points • 12 Apr 24

🕹 Technology Data science AI Machine Learning Programming Analytics

This newsletter provides links and updates about data science, AI, and machine learning. It's a helpful resource for anyone wanting to stay informed in this field.
One article teaches how to handle real questions using Python, which is great for people wanting practical coding skills. Another discusses techniques to make sure AI outputs stay on task.
The newsletter also features resources and courses to help people learn and improve their skills in data science and related areas. It's a good place to find learning opportunities.

How To Measure Data Quality - Issue 185

Data Analysis Journal • 235 implied HN points • 07 Feb 24

🕹 Technology Data science Analytics Data Quality Data Governance Metrics

Data quality metrics are essential for measuring data governance and analytics success.
There is no industry standard for defining poor-quality data; it varies based on context.
Having specific KPIs for data quality is crucial to scale data governance initiatives and improve the state of data quality.

Teaching Small Language Models to Reason

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 10 Jul 24

🕹 Technology Artificial Intelligence Machine Learning Data science Natural Language Computing

Using Chain-Of-Thought prompting helps large language models think through problems step by step, which makes them more accurate in their answers.
Smaller language models struggle with Chain-Of-Thought prompting and often get confused because they don't have enough knowledge and understanding like the bigger models.
Google Research has a method to teach smaller models by learning from larger ones. This involves using the bigger models to create helpful examples that the smaller models can then learn from.

Issue #2 - The Data Ecosystem: Where do you even start?

The Data Ecosystem • 119 implied HN points • 21 Apr 24

🕹 Technology Data science Machine Learning Data Engineering Data Governance Analytics

Data can be really complicated, and it's easy to miss how everything connects. People often focus on their own area and forget about the bigger picture of the data ecosystem.
Chief Data Officers (CDOs) are important but can only do so much to fix data issues. They deal with many challenges, including limited power, lack of experience, and politics within the organization.
To improve in the data field, we need to recognize the gaps in our knowledge, prioritize what to focus on, and continuously educate ourselves in both our own areas and related data domains.

Ranking Biases, Multi-Turn Woes, and State-of-the-Art Zero-Shot Speech

ppdispatch • 5 implied HN points • 16 May 25

🕹 Technology AI Data science Machine Learning

The 'Leaderboard Illusion' highlights how some AI models get unfair rankings because of selective information sharing. This can make it hard to know which models are truly the best.
Large Language Models (LLMs) struggle a lot in long conversations, with a big drop in their performance. They often lose track of conversations and can make mistakes early on that affect the whole chat.
MiniMax-Speech is a new tech for turning text into speech that can imitate voices in multiple languages. It also allows for cool features like expressing emotions in the voice.

Celebrating An Anniversary: Three Years of Writing About Analytics - Issue 154

Data Analysis Journal • 452 implied HN points • 26 Jul 23

🕹 Technology Analytics Data science Newsletter Writing Career development

The author reflects on three years of writing a newsletter about analytics, thanking supporters and subscribers.
The author's newsletter aims to document their journey, bridge the gap between academics and industry, and encourage classic data analysis.
The author shares insights on their writing strategy, the power of being small and independent, and future plans for the newsletter.

Data Science Weekly - Issue 523

Data Science Weekly Newsletter • 339 implied HN points • 01 Dec 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Analysis Software Development

Data science is evolving quickly, and it's important to stay updated with new advances and tools. Courses and reading lists can help you catch up and enhance your skills.
Using machine learning to solve real-world problems, like correctly attributing quotes, shows the practical applications of data science. Collaboration between universities and organizations can lead to innovative solutions.
The job market for data scientists is challenging right now. Many applicants are competing for limited positions, so if you're looking for a job, patience is key.

Data Science Weekly - Issue 536

Data Science Weekly Newsletter • 179 implied HN points • 01 Mar 24

🕹 Technology Data science AI Machine Learning Programming Statistics

The DSPy framework makes working with large language models easier by focusing on programming instead of complex prompting techniques. This helps reduce errors and improves usability.
A new sequence model approach shows better performance than traditional Transformers, especially for long data sequences. It also works faster, making it a promising development in the field.
Learning resources like online courses and free books on deep learning and causal ML can help deepen understanding of data science. They provide structured material that is great for both beginners and advanced learners.

GroupBy #39: 2000+ DBT models in airflow; Serverless Jupyter Notebooks at Meta

VuTrinh. • 59 implied HN points • 11 Jun 24

🕹 Technology Data Engineering Software Development Cloud Computing Analytics Data science

Meta has developed a serverless Jupyter Notebook platform that runs directly in web browsers, making data analysis more accessible.
Airflow is being used to manage over 2000 DBT models, which helps teams create and maintain their own data models effectively.
Building a data platform from scratch can be a valuable learning experience, revealing important lessons about data structure and management.

Imbalanced data? Why "Do Nothing" should be the default

Mindful Modeler • 419 implied HN points • 19 Sep 23

🔬 Science Data science Machine Learning Classification Model Evaluation

For imbalanced classification tasks, 'Do Nothing' should be the default approach, especially when dealing with calibration, strong classifiers, and class-based metrics.
Addressing imbalanced data should be considered in scenarios where misclassification costs vary, metrics are impacted by imbalance, or weaker classifiers are used.
Instead of using oversampling methods like SMOTE, adjusting data weighting, using cost-sensitive machine learning, and threshold tuning are more effective ways to handle class imbalance.

Data Science Weekly - Issue 521

Data Science Weekly Newsletter • 339 implied HN points • 17 Nov 23

🕹 Technology Data science Machine Learning AI Programming Analytics

JAX is becoming popular for its speed and capabilities, and learning it may be essential for those familiar with PyTorch. It does have a steeper learning curve, but there are resources to help ease the transition.
The demand for GPUs is skyrocketing, driven by various market factors. Understanding these dynamics can help anticipate the future of technology and resource availability in industries reliant on powerful computing.
Freelancing in data science can lead to an overwhelming number of job offers. Tips on finding clients on platforms like Upwork and LinkedIn can help navigate this new freelance landscape.

Data Science Weekly - Issue 518

Data Science Weekly Newsletter • 379 implied HN points • 27 Oct 23

🕹 Technology Data science Machine Learning Artificial Intelligence Programming Data Engineering

Web development is evolving with the use of local models and technologies for building applications, moving beyond just Python-based machine learning.
It's becoming increasingly important for developers to understand GPUs since they're widely used in deep learning and can greatly enhance performance.
Companies are exploring various use cases for generative AI that provide real value, focusing on practical implementations that drive return on investment.

Data Science Weekly - Issue 531

Data Science Weekly Newsletter • 219 implied HN points • 26 Jan 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

AI often gets criticized for the quality of its output, but that might not be the real issue people have with it. If quality is fixed, the conversation about AI could change significantly.
Common sense is tricky to define and measure, but researchers are developing ways to quantify it both individually and collectively. This could help clarify how we understand common sense in different contexts.
Large language models (LLMs) can transform education by encouraging hands-on learning. They offer opportunities for more interactive and engaging learning experiences.