The hottest Data science Substack posts right now

And their main takeaways

Data Science Weekly - Issue 491

Data Science Weekly Newsletter • 419 implied HN points • 21 Apr 23

AI academics are facing challenges keeping up with private sector investments. It's important for them to find survival strategies to remain competitive.
There are ongoing discussions about the rapid progress in machine learning and how it can be overwhelming for developers. Many are sharing thoughts on adapting to this fast-paced change.
Visualizing neural networks properly can help clarify concepts. There is a push for better diagrams to avoid confusion in understanding how these networks function.

RAG, Hallucination & Structure: Research By ServiceNow

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 59 implied HN points • 18 Apr 24

🕹 Technology AI Machine Learning Data science Natural Language Software Development

ServiceNow is using a method called Retrieval-Augmented Generation (RAG) to help transform user requests in natural language into structured workflows. This aims to improve how easily users can create workflows without needing deep technical knowledge.
By using RAG, they want to reduce 'hallucination', which is when AI generates wrong or irrelevant info, and make the AI more reliable. This is important for gaining user trust in AI systems.
The study also suggests future improvements, like changing output formats for efficiency and streamlining processes so that users can see steps one at a time, making it easier to follow along.

The Data-Conscious Software Engineer

Data Products • 3 implied HN points • 28 Jan 25

🕹 Technology Data science Software Engineering Data Management Data products Machine Learning

Data teams need to learn best practices from software engineering, but that's not enough. They also need engineers who understand how data works and can work well with them.
Collaboration between data teams and software engineers is really important for success. If they don't communicate well, they can struggle to implement necessary changes and solve issues together.
The idea of a 'data-conscious software engineer' is becoming essential. These engineers understand the value of data and can help improve how both teams work together, making both sides more efficient.

DataFrame

davidj.substack • 35 implied HN points • 20 Feb 25

🕹 Technology Data science Machine Learning Programming Cloud Computing Open Source

Polars Cloud allows for scaling across multiple machines, making it easier to handle large datasets than using just a single machine. This helps in processing data faster and more efficiently.
Polars is simpler to use compared to Pandas and often performs better, especially when transforming data for machine learning tasks. It supports familiar methods that many users already know.
Unlike SQL, which runs well on cloud services, using Pandas and R for large-scale transformations has been challenging. The new Polars Cloud aims to bridge this gap, providing more scalable solutions.

Evaluating The Quality Of RAG & Long-Context LLM Output

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 08 Jul 24

🕹 Technology AI NLP Machine Learning Data science Automation

Evaluating the performance of RAG and long-context LLMs is tough because there isn't a common task to compare them on. This makes it hard to know which system works better.
Salesforce created a new way to test these models called SummHay, where they summarize information from large text collections. The results show that even the best models struggle to match human performance.
RAG systems generally do better at citing sources, while long-context LLMs might capture insights more thoroughly but have citation issues. Choosing between them involves trade-offs.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Results from poll #4

The Counterfactual • 39 implied HN points • 21 May 24

🕹 Technology AI Research Machine Learning Data science Language Models Tech Trends

The recent poll found that two topics, an explainer on interpretability and a guide to becoming an LLM-ologist, were equally popular among voters.
The plan is to write about both topics in the coming months, keeping the content varied as usual.
Two new papers were published this month, one on multimodal LLMs and another on Korean language models, highlighting ongoing research in these areas.

Singh & Sins of AI 💦

Sector 6 | The Newsletter of AIM • 99 implied HN points • 13 Feb 24

🕹 Technology AI Open Source Software Data science Machine Learning

The Indian AI scene is growing, with many new language models being developed based on Meta's Llama 2. This shows a collaborative spirit in the open-source community.
There are specific models being made for different Indian languages like Kannada, Telugu, Odia, and Tamil. These models help in making AI more accessible to people speaking these languages.
There is a strong need for India to create its own unique open-source AI model. This would allow other developers to build on it rather than relying on external sources.

BI is not ready for AI

HyperArc • 3 HN points • 06 Sep 24

🕹 Technology AI Data science Machine Learning Business Intelligence Software Development

Business Intelligence (BI) needs both good models and great data to be effective with AI. Without quality data, AI can't really show its true power.
Many BI tools only focus on successful outcomes, like specific metrics, while ignoring the complete journey of discovery. This limited data can lead to missing important insights.
To improve AI's effectiveness in BI, we should include a wider range of experiences and exploration paths, not just successful queries. This fuller picture can help create better AI training sets.

Data Science Weekly - Issue 492

Data Science Weekly Newsletter • 379 implied HN points • 28 Apr 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Visualization Software Engineering

There is a new Slack community for paid subscribers focused on learning new tools and techniques in data science and career growth. It's a good place for support and sharing information.
A/B testing is important for experiments and there are recommended resources to help design and run successful tests. Proper planning and communication are key to making A/B testing effective.
Large Language Models (LLMs) are becoming more useful, and several resources are available for learning how to work with them. Understanding how they operate can help create valuable applications.

Complete Summary of Absolute, Relative and Rotary Position Embeddings!

Aziz et al. Paper Summaries • 79 implied HN points • 31 Mar 24

🕹 Technology Artificial Intelligence Machine Learning Natural Language Processing Computing Data science

Transformers can't understand the order of words, so position embeddings are used to give them that context.
Absolute embeddings assign unique values to each word's position, but they struggle with new positions beyond what they trained on.
Relative embeddings focus on the distance between words, which makes the model aware of their relationships, but they can slow down training and searching.

Robotics is Inching Towards it ChatGPT Moment

TheSequence • 84 implied HN points • 03 Nov 24

🕹 Technology AI Robotics Innovation Data science Machine Learning

Robots are getting smarter with new tech, especially using large language models, which help them learn and do tasks better.
MIT's new technique helps robots understand different types of data, making them more capable and efficient in their work.
There’s a big push for robots to interact more naturally with humans, like being able to feel and handle objects carefully, which can improve everyday tasks.

How AI generates images, visually explained 🎨

Year 2049 • 4 implied HN points • 20 Jan 25

🕹 Technology AI Computing Innovation Digital Media Data science

AI creates images using a process called diffusion. This means it starts with random noise and turns it into a clear image step by step.
Understanding how AI generates images helps demystify some of the technology behind AI and art. It's cool to see how computers can make creative expressions!
Learning about AI can open up more conversations about its impact on our everyday lives and the future of creativity. It's important to think about both the benefits and challenges.

What's a vector database?

Technically • 34 implied HN points • 21 Oct 24

🕹 Technology AI Data science Machine Learning Software Development Databases

A vector database is a special storage for data used in AI. It helps store numbers that represent different types of information like text or images.
To make AI models smarter, they need to use unique data from companies. This helps tailor responses and improve accuracy.
There are ways to enhance AI models with unique data, like fine-tuning them or using a method called Retrieval Augmented Generation (RAG) to include important information in prompts.

Improve Conversational UIs Using Social Intelligence

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 59 implied HN points • 09 Apr 24

🕹 Technology Artificial Intelligence Natural Language User Interface Machine Learning Data science

Social intelligence is important for conversational AIs to feel more human-like. It helps them understand emotions and social cues better.
A good conversational UI needs to consider cognitive, situational, and behavioral intelligence. This means the AI should know what you mean, the context of your words, and how to interact appropriately.
Using more data and different types of information beyond just words can help improve how AIs communicate. This could include things like images and gestures to understand conversations better.

Gambling with language models

Rain Clouds • 51 implied HN points • 31 Dec 24

🕹 Technology Machine Learning Cloud Computing Data science Financial Analysis Investing

Using AI models, like ModernBert, can help in predicting which stocks might perform better based on financial reports and market data. This means you can get insights without needing to be a finance expert.
The project combines cloud computing with machine learning, making it easier to process large amounts of financial data quickly. This is important for anyone looking to analyze stocks more efficiently.
While the model can make predictions, it's important to remember that investing in stocks always carries risks. Just because a model suggests a stock might do well, it doesn't guarantee success.

OpenAI Agent Query Planning Using LlamaIndex

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 99 implied HN points • 05 Feb 24

🕹 Technology AI Software Machine Learning Data science Programming

An OpenAI agent can analyze information from multiple documents at once. This helps create detailed answers to queries based on several sources.
Using the LlamaIndex framework, you can easily set up a system to manage and query PDF documents. This makes finding specific data more efficient.
The agent can summarize financial data, showing how companies like Uber grow revenue over time. This is helpful for understanding trends in business performance.

Data Science Weekly - Issue 484

Data Science Weekly Newsletter • 439 implied HN points • 02 Mar 23

🕹 Technology Data science Machine Learning Artificial Intelligence Software Development Cloud Computing

Data scientists need the right tools and environment to do their jobs effectively. Organizations can help by improving their data science infrastructure.
Understanding how to choose and advocate for important metrics is vital for product teams. This can lead to significant growth in user engagement.
A/B testing is crucial in fraud detection to compare models and determine their effectiveness. It can provide valuable insights that improve model performance.

Data Science Weekly - Issue 490

Data Science Weekly Newsletter • 379 implied HN points • 13 Apr 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Software Development

Data science is evolving quickly, and many new tools and techniques are being developed. This opens up exciting job opportunities in various fields like AI and machine learning.
Using programming languages like R and SQL can extend beyond traditional data analysis. They can be powerful tools for creative applications in data science.
Learning and implementing good practices in software development, such as automating tests and improving code efficiency, can save time and resources in data science projects.

2023 Kaggle AI Report

Bojan’s Newsletter • 196 implied HN points • 10 Oct 23

🕹 Technology Data science Machine Learning AI Research Competitions

Kaggle is a valuable platform for data science and ML career development
Kaggle solutions often offer innovative insights ahead of research and industry trends
Tabular data ML remains an important area in the field of machine learning

Who fires a gun at a school? Data science provides eight archetypes of shooters

School Shooting Data Analysis and Reports • 39 implied HN points • 13 May 24

🇺🇸 U.S. Politics Data science School Shootings

Data science can create archetypes to understand different behaviors, like predicting customer preferences or identifying school shooter profiles.
Using data analysis, it's possible to categorize and plan for different scenarios of school shooters based on past incidents.
The first school shooter archetype is 'The Adolescent Insider,' comprising attributes like age, gender, victim count, typical outcomes, and likely circumstances.

The Counterfactual's poll #3

The Counterfactual • 59 implied HN points • 04 Apr 24

🕹 Technology Artificial Intelligence Language Models Data science Human-computer interaction

In April, readers can vote on research topics for the next article, making it a collaborative effort. This way, subscribers influence the content that gets created.
Past topics have focused on empirical studies involving large language models and the readability of texts. This shows a trend toward practical investigations in the field.
One of the proposed topics is about how language models might respond differently based on the month, which can lead to fun and insightful experiments.

sqlmesh init duckdb

davidj.substack • 71 implied HN points • 03 Dec 24

🕹 Technology Data science Software Development APIs Analytics Cloud Computing

There's a new public repository called bluesky-data where people can collaborate and follow along with its development. It's easy to get started by setting it up on your local machine.
Using sqlmesh with the Bluesky data can provide real-time data availability, while also allowing for a more complete view of the data in a batch processing style. This means you can get both immediate updates and historical data.
It's better to start with dlt and then initialize sqlmesh within that project. This way, you can efficiently manage large datasets without needing to compute everything each time.

The Sequence Radar #486 : The Amazing AlphaGeometry2 Now Achieved Gold Medalist in Math Olympiads

TheSequence • 28 implied HN points • 09 Feb 25

🕹 Technology AI Machine Learning Data science Research Software Development

AlphaGeometry2 has become a top performer in solving geometry problems, even surpassing human math Olympiad gold medalists. It can handle tough geometry concepts and has a better understanding of different math problems compared to its predecessor.
The latest improvements in AlphaGeometry2 include an enhanced symbolic engine and a wider range of mathematical language features. This allows it to solve more complex geometry problems efficiently.
AI is getting closer to matching or even exceeding human capabilities in competitive mathematics. This success in geometry could lead to similar advancements in other scientific fields like physics and chemistry.

Week #2: Intuition Behind Conformal Prediction

Mindful Modeler • 379 implied HN points • 27 Dec 22

🔬 Science Data science Machine Learning Statistics Training Model Evaluation

Conformal prediction for classification works by ordering predictions from certain to uncertain, dividing them based on a user-defined confidence level.
Conformal prediction consists of three main steps: training, calibration, and prediction, following a similar recipe across different algorithms.
Different resampling strategies like k-fold cross-splitting and jackknife are used in conformal prediction, offering a balance between computation cost and prediction accuracy.

Proxy Fine-Tuning LLMs

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 79 implied HN points • 26 Feb 24

🕹 Technology AI Machine Learning Data science Software Development

Proxy fine-tuning lets you improve a language model's performance without changing its internal settings. It only uses the model's output to make adjustments.
Combining different approaches, like retrieval and fine-tuning, can lead to better results with language models. It's about using the best methods together instead of relying on just one.
Using proxy fine-tuning can help organizations better understand and organize their data. It encourages them to explore their information needs more deeply.

Data Science Weekly - Issue 494

Data Science Weekly Newsletter • 319 implied HN points • 12 May 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Visualization Engineering

Open source AI is rapidly advancing, but may always lag behind the best quality models. It's great for innovation but has its limits.
Many academic papers promise data sharing but often fail to deliver, which can hinder scientific research and verification.
Understanding how to craft effective prompts is essential when using generative AI tools. This skill can greatly enhance the results you get from those tools.

Data Science Weekly - Issue 504

Data Science Weekly Newsletter • 239 implied HN points • 21 Jul 23

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

AI companies are complicated and must consider many factors like research, funding, and competition. Understanding these can help predict how they might evolve in the future.
Debriefs, or team discussions after projects, can greatly boost team performance. They help everyone learn from experiences and improve future collaboration.
New research shows that specific ingredient pairings in food can be explained by flavor networks. This indicates there are universal patterns in how different foods complement each other.

Data Science Weekly - Issue 493

Data Science Weekly Newsletter • 319 implied HN points • 05 May 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Visualization Engineering

Data scientists often lack key skills needed for the job, which can be frustrating for those hiring. It's important for data scientists to continually improve their skills and adapt to job requirements.
There's a significant increase in data downtime and resolution times, signaling that overall data quality management needs improvement. Companies should focus on better data practices to enhance their operations.
New programming languages, like Mojo, are emerging that aim to simplify coding and enhance user experience. These advancements can make programming more accessible and enjoyable for everyone.

sqlmesh cube_generate

davidj.substack • 59 implied HN points • 16 Dec 24

🕹 Technology Software Development Data science

Building integrations can seem tough, but understanding the metadata available can simplify the process. It's important to leverage existing tools to create new functionalities efficiently.
Trying out new ideas, even if they might fail, is essential for learning and discovering possibilities. Taking small steps can help you manage potential setbacks.
Creating a command to generate projects based on existing data models can streamline workflows. It allows for easier implementation of complex data relationships when set up correctly.

The Sequence Chat: The Transition that Changes Everything. From Pretraining to Post-Training in Foundation Models

TheSequence • 56 implied HN points • 04 Dec 24

🕹 Technology AI Machine Learning Computing Innovation Data science

The transition from pretraining to post-training in AI models is a big deal. This change helps improve how AI can reason and learn from data.
New models like DeepSeek's R1 and Alibaba's QwQ are now using this transition to become smarter and more effective. They can solve complex problems better than before.
The shift is moving away from old methods like reinforcement learning with human feedback. Instead, there are new ways being developed that promise to make AI work even better.

Enterprises Need RAG, Not Fine-Tuning.

Sector 6 | The Newsletter of AIM • 19 implied HN points • 26 Jun 24

🕹 Technology AI Machine Learning Data science Software Development Information Systems

Retrieval Augmented Generation (RAG) is more effective than fine-tuning for enterprises. It connects to external data sources, making it easier to get accurate information.
Using RAG helps reduce hallucinations in language models, which means the outputs are more reliable and trustworthy.
Enterprises can maintain better control over their information by using RAG, ensuring relevant and precise responses.

Edge 449: Getting Into Adversarial Distillation

TheSequence • 63 implied HN points • 19 Nov 24

🕹 Technology Artificial Intelligence Machine Learning Data science Software Development

Adversarial distillation is a new model training method inspired by generative adversarial networks (GANs). It uses a setup where one part generates data and another part tries to tell if it's real or fake.
This method helps improve knowledge transfer in models by combining typical distillation techniques with adversarial training. It's like guiding a student while testing their understanding.
The process involves a generator that creates synthetic samples and a discriminator that distinguishes these samples from real ones, making learning more effective.

Unlocking Agricultural Potential: The Impact of Satellite Data in Modern Farming

Space Ambition • 199 implied HN points • 14 Jul 23

🕹 Technology Space Tech Data science Sustainability

Satellite data can greatly help farmers by improving crop yields and monitoring crop health. This information allows for better planning and decision-making in farming.
Using space data can lead to more sustainable farming practices. Farmers can track things like carbon storage and soil health, which helps protect the environment.
The use of satellite imagery is still new in agriculture, but it has a lot of potential. However, challenges such as regional differences and competition from traditional farming methods can slow its adoption.

FlowMind Is An Automatic Workflow Generator

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 25 Jun 24

🕹 Technology AI Automation Workflow Data science Programming

FlowMind is a new tool that helps create automatic workflows using advanced AI. It takes user requests and generates code to complete tasks quickly.
The system uses APIs to gather information and provides real-time feedback, allowing users to adjust the workflows as needed. This makes the process more interactive.
FlowMind aims to improve the reliability of AI by reducing errors and making sure there is no direct connection to sensitive data. It focuses on keeping user data safe while handling requests.

Data Science Weekly - Issue 486

Data Science Weekly Newsletter • 359 implied HN points • 17 Mar 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Analytics

AI and data science are evolving rapidly, making it challenging for many to keep up. It's common for professionals to feel overwhelmed as they try to understand new advancements.
There's a growing discussion about whether we should slow down AI development. Some people believe we need to pause and figure out the implications of current technologies before moving forward.
Many professionals are exploring career shifts between data science and data engineering. It's important to consider personal interests and skills when deciding which path to take.

AI Agents & Marketing: It's Weird.

The Digital Anthropologist • 19 implied HN points • 24 Jun 24

🕹 Technology AI Marketing Ethics Data science Automation

In the future, marketers might need to create separate campaigns for humans and AI agents, requiring unique approaches for each audience.
Marketing teams are facing the challenge of designing campaigns that cater to both human and AI customers, necessitating the development of dual marketing strategies and content.
The integration of AI agents in marketing campaigns has led to increased costs and complexities, requiring specialized roles, technologies, and strategies to navigate successfully.

Data Science Weekly - Issue 565

Data Science Weekly Newsletter • 1 HN point • 19 Sep 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

Reading The Data Science Weekly is a great way to stay updated on AI and machine learning topics. It shares links, news, and resources that can help anyone interested in these fields.
There are many useful techniques in data science, like the Hampel Filter for outlier detection, which can help improve data quality. Exploring these methods can really enhance your understanding and skills.
Effective communication is crucial in data science. How you explain your findings can significantly impact your career, so it's important to work on your communication skills.

Edge 451: Is One Teacher Enough? Understanding Multi-Teacher Distillation

TheSequence • 56 implied HN points • 26 Nov 24

🕹 Technology Machine Learning Artificial Intelligence Data science Computing Software Development

Using multiple teachers in distillation is better than just one. This method helps combine different areas of knowledge, making the student model more powerful.
Each teacher can focus on a specific type of knowledge, like understanding features or responses. This specialization leads to a more balanced learning process.
Although this approach might be more expensive to implement, it creates a stronger and less biased model overall.

Saturday morning open tabs

Scott's Substack • 78 implied HN points • 10 Feb 24

🕹 Technology AI Tech news Virtual reality Data science

The post discusses the experience of switching phone carriers and the challenges faced, emphasizing the impact of not having a phone for a few days.
The post touches on upcoming summer plans including workshops in Madrid, Scotland, and potential travel to Vietnam, highlighting the diversity of travel experiences planned.
The author explores the new Apple Vision Pro product, contemplating its potential usage for work, entertainment, and travel, showcasing a mix of curiosity and skepticism.

DoRA is The New LoRA!

Aziz et al. Paper Summaries • 59 implied HN points • 07 Apr 24

🕹 Technology Artificial Intelligence Machine Learning Data science Programming Software Development

LoRA helps fine-tune large language models without changing all their parameters. It uses two small matrices, which keeps the performance quick during use.
LoRA's updates to weights can miss valuable details you'd get from full fine-tuning, because it treats magnitude and direction together.
DoRA improves on LoRA by separating magnitude and direction, leading to better performance on reasoning tasks and other applications. It works best with smaller settings, making it efficient.