The hottest Data science Substack posts right now

And their main takeaways

Data Science Weekly - Issue 503

Data Science Weekly Newsletter • 219 implied HN points • 14 Jul 23

Machine learning is making its way into finance, and researchers are identifying practical uses for it. This can help finance professionals learn new tools and statisticians find interesting financial problems to solve.
AI platforms, like social media, are becoming crucial in our lives but can be confusing and unreliable. People are figuring out how to use these platforms effectively despite their unpredictability.
Large language models are changing how data scientists work. These models can automate many tasks, allowing data scientists to focus on managing and assessing the AI's outputs.

Issue 17: No great stagnation in cruise ships

The Works in Progress Newsletter • 12 implied HN points • 05 Dec 24

🕹 Technology Infrastructure Engineering Urbanism Biotechnology Data science Chemistry

Cruise ships show that new ideas and growth are still possible in design and urban living, even as some land technologies seem to stall.
Madrid has successfully built its metro system much faster and cheaper than cities like London and New York by using smart planning and incentives for local leaders.
Many animals, like horses and crabs, are essential for creating life-saving chemicals, reminding us that we still rely on nature, even as technology advances.

The Tech Buffet #1: How To Design a System To Chat With Your Private Data

The Tech Buffet • 159 implied HN points • 04 Sep 23

🕹 Technology AI Data science Software Development Architecture Machine Learning

Building a custom chatbot helps in getting accurate answers from specific internal data without the risk of it making things up. This is especially useful for specialized knowledge.
Using a chatbot saves time and makes it super easy to find information quickly, boosting productivity for users.
You can keep improving and updating the bot as your data changes, and you have full control over privacy by using open-source tools.

Vesuvius Challenge Progress Prizes: December Edition

Vesuvius Challenge • 31 implied HN points • 24 Jan 25

🕹 Technology Data science Machine Learning Computer Vision Community Engagement Open Source

The community is focused on improving data quality, like using better labels and refining how they categorize information. This will help them create automated tools for analyzing scrolls more effectively.
Several contributors have made significant advancements in developing new segmentation models and tools, which will help in analyzing scroll data. These innovations are key for understanding ancient texts.
2024 has been a great year for teamwork and progress as everyone shares their findings. The hard work from many people is leading to quick improvements in technology for studying historical scrolls.

The Sequence Knowledge #463: Wrapping Up our Series About Knowledge Distillation: Pros and Cons

TheSequence • 35 implied HN points • 07 Jan 25

🕹 Technology Machine Learning Artificial Intelligence Data science Deep Learning Research

Knowledge distillation is a method where a smaller model learns from a larger, more complex model. This helps make the smaller model efficient while retaining essential features.
The series covered different techniques and challenges in knowledge distillation, highlighting its importance in machine learning and AI development. Understanding these can help when deciding if this approach is suitable for your projects.
It's useful to be aware of both the benefits and drawbacks of knowledge distillation. This helps in figuring out the best way to implement it in real-world applications.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Data Science Weekly - Issue 496

Data Science Weekly Newsletter • 259 implied HN points • 26 May 23

🕹 Technology Data science Machine Learning Artificial Intelligence Software Development Data Visualization

AI has great potential to improve our lives but also comes with risks if misused. It's important to balance optimism and caution.
Tools like Copilot in Power BI make it easier for users to analyze and visualize data by allowing them to communicate their needs in plain language.
The concept of the 'Curse of Dimensionality' shows that sometimes having too much data can confuse models instead of helping them make better predictions.

Exploring the Purpose, Power & Potential of Small Language Models (SLMs)

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 59 implied HN points • 11 Mar 24

🕹 Technology AI Language Models Open Source Machine Learning Data science

Small Language Models (SLMs) can effectively handle specific tasks without needing to be large. They are more focused on doing certain jobs well rather than trying to be everything at once.
The Orca 2 model aims to enhance the reasoning abilities of smaller models, helping them outperform even bigger models when reasoning tasks are involved. This shows that size isn't everything.
Training with tailored synthetic data helps smaller models learn better strategies for different tasks. This makes them more efficient and useful in various applications.

Data Science Weekly - Issue 505

Data Science Weekly Newsletter • 199 implied HN points • 28 Jul 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Data Visualization

Large language models use complex methods like word vectors and transformers to understand language, but this can be explained simply without heavy math. They need a lot of data to perform well.
Using AI tools like ChatGPT for real-world programming tasks can streamline the coding process, as it allows for a more focused workflow without switching between different resources.
Building effective data storage systems, like Amazon S3, involves overcoming interesting challenges and nuances, demonstrating the amazing technology behind big data management.

The Tech Buffet #22: Why You Should Consider Weaviate As Your Ultimate Vector Database

The Tech Buffet • 39 implied HN points • 23 Apr 24

🕹 Technology Machine Learning Data science Software Development Database Management AI Applications

Weaviate is a powerful vector database that helps in creating advanced AI applications. It's useful for managing large amounts of data and performing semantic searches efficiently.
When working with Weaviate, you can easily load and index data, allowing for quick access to information. This makes it easier to build systems that need to handle a lot of data quickly.
Weaviate supports different search methods like vector search, keyword search, and hybrid search. This way, you can find the most relevant results based on your needs.

Data Science Weekly - Issue 489

Data Science Weekly Newsletter • 299 implied HN points • 06 Apr 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Visualization Data Engineering

Understanding linear programming can help solve complex problems using Python. It's useful in various fields and can optimize outcomes.
MLOps is closely related to data engineering, showing that managing data for machine learning involves more engineering than initially thought.
The new pandas 2.0 version has exciting features like the Apache Arrow backend, which will enhance its performance and capabilities.

Data Science Weekly - Issue 485

Data Science Weekly Newsletter • 319 implied HN points • 09 Mar 23

🕹 Technology Data science Machine Learning AI Data Engineering Data Visualization

The newsletter shares interesting links about data science, machine learning, and AI each week. It’s a good way to keep up with new trends and knowledge in the field.
There's a discussion on what databases should do but often don’t. Understanding these gaps can help you improve your data projects by knowing what to build yourself.
AI's impact on jobs and industries is being researched, especially how language models like ChatGPT could change certain occupations. It's important to understand how AI can affect your career choices.

Data Science Weekly - Issue 500

Data Science Weekly Newsletter • 219 implied HN points • 23 Jun 23

🕹 Technology Data science Artificial Intelligence Machine Learning Data Visualization Data Engineering

AI technology is advancing quickly and can even cover public meetings, but we need to think carefully about its readiness for everyday use.
Engineers can improve their people skills and interactions by applying the same problem-solving mindset they use in their technical work.
Generative AI is becoming important in data science for creating synthetic data, which helps in privacy and enhances analysis without losing useful information.

DR-RAG: Applying Dynamic Document Relevance To Question-Answering RAG

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 14 Jun 24

🕹 Technology AI Machine Learning NLP Data science

DR-RAG improves how we find information for question-answering by focusing on both highly relevant and less obvious documents. This helps to ensure we get accurate answers.
The process uses a two-step method: first, it retrieves the most relevant documents, then it connects those with other documents that might not be directly related, but still helps in forming the answer.
This method shows that we often need to look at many documents together to answer complex questions, instead of relying on just one document for all the needed information.

Revolutionizing Data Science: The Latest Trends in Automation, Experimentation, and Language Model Evaluation

Gradient Flow • 259 implied HN points • 26 Jan 23

🕹 Technology Data science Automation Experimentation Language Models

The need for tools to help developers pick models that fit their needs and understand model limitations as general-purpose models are widely used.
Data science teams are tackling automation and early examples targets aspects of projects like modeling and coding assistance, but further advancements are needed.
There's a shortage of research and tools for experimentation and optimization in data science, creating opportunities for entrepreneurs to deliver innovative solutions.

The Tech Buffet #16: Quickly Evaluate your RAG Without Manually Labeling Test Data

The Tech Buffet • 99 implied HN points • 18 Dec 23

🕹 Technology AI Programming Data science Automation Machine Learning

You can automate the testing of Retrieval Augment Generation (RAG) systems without needing to label data yourself. This makes it faster and easier to evaluate their performance.
Generating synthetic datasets with questions and answers allows you to test how well your RAG performs. This method helps you understand the effectiveness of your application and provides useful insights.
Using various metrics is key to evaluating your RAG accurately. This way, you assess different aspects of performance, ensuring you get a well-rounded view of how your system is doing.

Data Science Weekly - Issue 499

Data Science Weekly Newsletter • 219 implied HN points • 16 Jun 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering NLP

Using large language models can help kids learn to ask curious questions by automating the teaching process.
New techniques for 3D space reconstruction can make indoor views on platforms like Google Maps look more realistic and interactive.
There's a growing need to understand the value of personal data in online shopping, especially as new regulations come into play.

E2- Basics of Large Language Models for Product Managers 🤖

The Product Channel By Sid Saladi • 16 implied HN points • 17 Nov 24

🕹 Technology AI Product Management Innovation Natural Language Processing Data science

Large language models (LLMs) are special AI systems that understand and generate human language. They can do things like summarize texts, translate languages, and even write codes.
LLMs are changing many industries by powering chatbots, helping create content, and giving personalized product recommendations. This makes services smarter and more helpful.
Building custom LLMs requires a lot of money and data. Companies must invest millions and gather vast amounts of information to develop effective models.

The Tech Buffet #6: Why Your RAG is Not Reliable in Production

The Tech Buffet • 139 implied HN points • 10 Oct 23

🕹 Technology Machine Learning Software Development Artificial Intelligence Data science System Design

RAG systems can produce impressive results but require careful tuning to be reliable in real-world applications. Just copying and pasting code won't necessarily work for complex use cases.
Understanding the RAG framework is important, as it involves various components like data loaders, splitters, and embedding models. Each part plays a crucial role in generating accurate answers.
Using frameworks like LangChain can simplify the process of prototyping RAG systems, but they still need thoughtful configuration to function effectively in production.

Data Science Weekly - Issue 495

Data Science Weekly Newsletter • 239 implied HN points • 19 May 23

🕹 Technology Data science Machine Learning Artificial Intelligence Natural Language Software Development

Absence of evidence can often serve as strong evidence of absence, and this idea can be explored with Bayesian methods.
Natural language processing is being used to analyze global supply chains, helping create networks from news articles.
It's crucial to understand the unique challenges and opportunities in personalizing search results, as seen with Netflix's approach.

What To Do When You're Stuck At A Business That Doesn't Care About Data Science

High ROI Data Science • 357 implied HN points • 27 Feb 23

💼 Business Data science Leadership Technical Debt Strategic Planning

Many data scientists in companies that don't prioritize data science end up doing basic reporting and analytics.
Technical management in such companies often lack the understanding and incentives to support data initiatives.
Navigating a lack of data culture and strategy in a company requires significant effort but can lead to valuable career opportunities.

Data Science Weekly - Issue 498

Data Science Weekly Newsletter • 219 implied HN points • 09 Jun 23

🕹 Technology Data science Machine Learning AI Big Data Data Visualization

Data modeling in data science is complex and often messy, making it hard to get reliable answers. This issue highlights the need for better practices and understanding in this area.
There are ongoing discussions about the realities of working in data science. Sharing these experiences can help others prepare for the challenges they may face.
Generative AI is a big topic right now, and there are frameworks being developed to help organizations strategize its use effectively. Exploring these can guide businesses in adopting AI responsibly.

Smarter Agents, Self-Aware LLMs, and Knowledge from Videos

HackerPulse Dispatch • 2 implied HN points • 24 Jan 25

🕹 Technology AI Machine Learning Data science Video Processing Software Engineering

New techniques can shrink the size of data storage without losing accuracy, which helps in finding information faster.
Language models are getting better at learning from their own mistakes, making them smarter and more self-aware.
AI can now learn complex skills just by watching videos, which shows that reading text isn't always necessary for advanced learning.

Is Step Back Prompting The Best Prompting Strategy?

Aziz et al. Paper Summaries • 59 implied HN points • 20 Mar 24

🕹 Technology AI Machine Learning Data science Computing Engineering

Step Back Prompting helps models think about big ideas before answering questions. This method shows better results than other prompting techniques.
Even with Step Back Prompting, models still find it tricky to put all their reasoning together. Many errors come from the final reasoning step which can be complicated.
Not every question works well with Step Back Prompting. Some questions need quick, specific answers instead of a longer thought process.

Data Science Weekly - Issue 488

Data Science Weekly Newsletter • 279 implied HN points • 30 Mar 23

🕹 Technology Data science Machine Learning Artificial Intelligence Software Development Data Visualization

This week's newsletter features discussions on AI and its potential risks, highlighting different viewpoints on the future of technology.
Career development in data science is important. There are resources and talks from experts that focus on skills that help you succeed in this field.
New updates in the Tidyverse can improve your coding experience in data science, making it easier and more efficient to work with data.

Tree Of Thoughts Prompting (ToT)

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 11 Jun 24

🕹 Technology AI Machine Learning Natural Language Processing Data science Programming

Tree of Thoughts (ToT) is a new way to solve complex problems with language models by exploring multiple ideas instead of just one.
It breaks down problems into smaller 'thoughts' and evaluates different paths, similar to how humans think through problems.
ToT allows models to understand not just the solution but also the reasoning behind it, making decision-making more deliberate.

Alignment in AI: Key to Safe and Beneficial Systems

Gradient Flow • 199 implied HN points • 23 Mar 23

🕹 Technology AI Machine Learning Data science Ethics Research

Alignment in AI is crucial to ensure that AI systems behave in beneficial and secure ways by aligning goals with human values and objectives.
To start aligning AI systems effectively, teams can use methodologies like human-in-the-loop testing, adversarial training, model interpretability, and value alignment algorithms.
Emphasizing alignment early on in AI development can help teams avoid ethical and legal issues and build trust with stakeholders and users by formalizing existing practices and expanding alignment tools.

Catalog of Catalogs

davidj.substack • 59 implied HN points • 14 Nov 24

🕹 Technology Data science Software Development Information Systems Data Engineering Cloud Computing

Data tools create metadata, which is important for understanding what's happening in data management. Every tool involved in data processing generates information about itself, making it a catalog.
Not all catalogs are for people. Some are meant for systems to optimize data processing and querying. These system catalogs help improve efficiency behind the scenes.
To make data more accessible, catalogs should be integrated into the tools users already work with. This way, data engineers and analysts can easily find the information they need without getting overwhelmed by unnecessary data.

OpenAI Deep Research Explains Itself

From the New World • 26 implied HN points • 06 Feb 25

🕹 Technology AI Hardware Software Data science Machine Learning

AI hardware has evolved significantly, from early specialized chips to powerful GPUs and TPUs. These advancements make training AI models much faster and more efficient.
The design of algorithms, especially with transformers, has greatly improved AI's ability to understand and generate language. These models can now learn complex patterns that were hard to capture before.
Building and maintaining large AI systems requires careful planning and practices. Companies need efficient workflows and monitoring systems to manage data, hardware, and software effectively.

OpenAI's o-1 and inference-time scaling laws

Tanay’s Newsletter • 63 implied HN points • 28 Oct 24

🕹 Technology AI Models Machine Learning Computing Data science Tech Trends

OpenAI's o-1 model shows that giving AI more time to think can really improve its reasoning skills. This means that performance can go up just by allowing the model to process information longer during use.
The focus in AI development is shifting from just making models bigger to optimizing how they think at the time of use. This could save costs and make it easier to use AI in real-life situations.
With better reasoning abilities, AI can tackle more complex problems. This gives it a chance to solve tasks that were previously too difficult, which might open up many new opportunities.

Latent Reasoning, 3D Colorization, and the Limits of RL

HackerPulse Dispatch • 8 implied HN points • 13 Dec 24

🕹 Technology AI Machine Learning Computer Vision Reinforcement Learning Data science

COCONUT is a new method that lets language models think in flexible ways, making it better at solving complex problems. It does this by using continuous latent spaces instead of just words.
ChromaDistill offers a smart way to add color to 3D images efficiently. It lets you view these scenes consistently from different angles without slowing things down.
Recent research shows that top AI models can be deceptive and plan strategically, which raises important safety concerns. There’s also a new approach to testing AI limits in a friendly, curiosity-driven way.

Google Correlate alternative: Similiarity search of Wikipedia Pageview Statistics in Python

Franz likes to code • 1 HN point • 16 Sep 24

🕹 Technology Programming Data science Machine Learning Web Development

Google Correlate was a tool for finding related search patterns, similar to Google Trends, but it was shut down in 2019.
You can create a personal alternative using publicly available data, like Wikipedia page views, by scraping and analyzing it with Python.
Using methods like similarity searches and cosine distance, you can identify articles that have similar view patterns to a given topic.

AI is Racing Forward – on a Very Long Road

Am I Stronger Yet? • 15 implied HN points • 12 Nov 24

🕹 Technology Artificial Intelligence Machine Learning Automation Software Development Data science

AI is making rapid progress, but it is not close to achieving artificial general intelligence (AGI). Many tasks still require human capabilities, showing that there is still a long way to go.
Current AIs excel at specific tasks but struggle with complex, nuanced tasks that require extensive context or emotional intelligence, like managing a classroom or writing a novel.
While there are exciting advancements happening with AI, the journey towards true intelligence is more like crossing a vast ocean than a quick sprint, suggesting that there are many challenges ahead.

Edge 447: Not All Model Distillations are Created Equal

TheSequence • 49 implied HN points • 12 Nov 24

🕹 Technology Machine Learning Artificial Intelligence Data science Software Development Algorithms

There are different types of model distillation that help create smaller, more efficient AI models. Understanding these types can help in choosing the right method for specific tasks.
The three main types of model distillation are response-based, feature-based, and relation-based. Each has its own strengths and can be used depending on what you need from the model.
Response-based distillation is usually the easiest to implement. It focuses on how the student model responds to similar inputs as the teacher model.

💥 Tech Talks Weekly #20 (React Summit, JSNation, PyCon Sweden, PyData London, Voxxed Days Trieste, Spring I/O, CITYJS, ...)

Tech Talks Weekly • 19 implied HN points • 28 Jun 24

🕹 Technology Software Development Conferences Programming Languages Web Development Data science

The Tech Talks Weekly shares new tech conference talks each week, so you can catch up on the latest ideas without scrolling through messy video lists.
This week features talks from major events like the React Summit and PyCon, covering a variety of topics in programming and tech.
You can help grow the Tech Talks community by sharing it with friends and filling out a short form to provide feedback.

E1- Introduction to General AI for Product Managers 🤖

The Product Channel By Sid Saladi • 16 implied HN points • 10 Nov 24

🕹 Technology AI Product Management Data science Machine Learning Software Development

AI is changing how products are made and used. Product managers need to understand AI to stay ahead in their industry.
There are many AI applications, like chatbots and recommendation systems, that can improve user experience. Learning about these tools can help product managers create better products.
While AI has benefits, it also brings risks like bias and job losses. It's important for product managers to think about these issues and apply AI responsibly.

FIT-RAG: Are RAG Architectures Settling On A Standardised Approach?

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 39 implied HN points • 02 Apr 24

🕹 Technology AI Architecture Data science Language Models Machine Learning

As RAG systems evolve, they are integrating more smart features to enhance their effectiveness. This means they are not just providing basic responses but are becoming more advanced and adaptable.
The challenges with RAG include static rules for retrieving data and the problem of excessive tokens during processing. These issues can slow down performance and reduce efficiency.
FIT-RAG is addressing these challenges with new tools, like a special document scorer and token reduction strategies, to improve how information is retrieved and used. This helps RAG systems provide better answers while using fewer resources.

Friend Recommendation Retrieval in a social network

Recommender systems • 43 implied HN points • 24 Nov 24

🕹 Technology Machine Learning AI Models Social Networks Data science

Friend recommendation systems use connections like 'friends of friends' to suggest new friends. This is a common way to make sure suggestions are relevant.
Two Tower models are a new approach that enhances friend recommendations by learning from user interactions and focusing on the most meaningful connections.
Using methods like weighted paths and embeddings can improve recommendation accuracy. These techniques help to understand user relationships better and avoid common pitfalls in recommendations.

Weekly Top Picks #67

The Algorithmic Bridge • 116 implied HN points • 18 Mar 24

🕹 Technology AI Tech news Artificial Intelligence Software Engineering Data science

The post discusses Nvidia GTC keynote, BaaS in science, Apple's potential collaboration with Google Gemini, and more key AI topics of the week.
It features conversations between Sam Altman and Lex Friedman, touches on jobs in the AI era, and examines the response from NYT to OpenAI.
There's a question about whether OpenAI's Sora model is trained using YouTube videos, among other intriguing topics.

Data Science Weekly - Issue 501

Data Science Weekly Newsletter • 179 implied HN points • 30 Jun 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Data Visualization

Data scientists are sharing tips on how to make their scientific data more accessible and useful. This helps others to understand and use the data better.
There are many discussions happening about the benefits and drawbacks of large language models (LLMs) like ChatGPT. Some people believe they are amazing, while others think they aren't very helpful.
Naming things in programming can be tough, but there are resources and books that can help. Learning the right naming conventions can improve coding practices.

Data Science Weekly - Issue 497

Data Science Weekly Newsletter • 199 implied HN points • 02 Jun 23

🕹 Technology Data science Machine Learning Artificial Intelligence Software Engineering Statistical Analysis

Data drift doesn't always hurt model performance, so it's important to analyze the context before reacting to it.
Work on solving bigger problems as you grow in your career, instead of waiting for difficult tasks to be handed to you.
To improve a model's reasoning skills, reward it for each correct step in problem-solving, not just the final answer.