The hottest Data science Substack posts right now

And their main takeaways

Run Large Language Model On Your Own Computer

The Beep • 19 implied HN points • 10 Mar 24

🕹 Technology AI Software Hardware Programming Data science

You can run large language models, like Llama2, on your own computer using a tool called Ollama. This allows you to use powerful AI without needing super high-tech hardware.
Setting up Ollama is simple. You just need to download it and run a couple of commands in your terminal to get started.
Once it's running, you can interact with the model like you would with any chatbot. This means you can type prompts and get responses directly from your own machine.

Do you want to do a project with some great Northwestern students?

Mike Talks AI • 39 implied HN points • 22 Nov 23

🚌 Education Machine Learning Data science

A class at Northwestern offers projects with companies and non-profits for student teams.
Students have worked with organizations like UPS, Ferrara Candy, and Lurie Children's Hospital.
Students undergo rigorous training in probability, statistics, machine learning, and optimization before working on projects.

Mistakes from my Failed Startup in Scalping Concert Tickets

Jay's Data Stream • 23 implied HN points • 30 Oct 24

💼 Business Entrepreneurship Market Trends Data science

The concert ticket market is built on false pricing, where tickets are sold for lower than their actual value. This means people often pay much more on resale markets.
Making money by reselling tickets is much harder than it seems. Success requires understanding a lot about the market and using technology to navigate tough ticketing systems.
Creating a startup in this space is complicated and needs more than just good ideas. It's about having the right infrastructure to turn those ideas into profitable actions.

Microsoft’s New Love

Sector 6 | The Newsletter of AIM • 39 implied HN points • 17 Nov 23

🕹 Technology AI Development Machine Learning Natural Language Data science Software Engineering

Large language models (LLMs) like ChatGPT are powerful but costly to run and customize. They require a lot of resources and can be tricky to adapt for specific tasks.
Small language models (SLMs) are emerging as a better option because they are cheaper to train and can give more accurate results. They also don't need heavy hardware to operate.
Many companies are starting to focus on developing small language models due to their efficiency and effectiveness, marking a shift in the industry.

Self-Reflective Retrieval-Augmented Generation (SELF-RAG)

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 04 Mar 24

🕹 Technology AI Machine Learning Generative AI Language Models Data science

SELF-RAG is designed to improve the quality and accuracy of responses from generative AI by allowing the AI to reflect on its own outputs and decide if it needs to retrieve additional information.
The process involves generating special tokens that help the AI evaluate its answers and determine whether to get more information or stick with its original response.
Balancing efficiency and accuracy is crucial; too much focus on speed can lead to wrong answers, while aiming for perfect accuracy can slow down the system.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Everybody Likes RAG

Sector 6 | The Newsletter of AIM • 39 implied HN points • 15 Nov 23

🕹 Technology AI Software Data science Innovation Digital Tools

RAG stands for retrieval-augmented generation, which is becoming really popular in the tech world. People are eager to use it for their work.
It offers many benefits like better access to current information and helps to verify sources. It's also efficient and cost-effective.
Some see RAG as just a fancy version of prompt engineering, but others think it's essential for growing business applications.

Scale, Scale, Scale

Gradient Flow • 179 implied HN points • 05 May 22

🕹 Technology AI Podcasts Data science Distributed Computing Summit

The importance of scale in AI startups highlighted by the proficiency in distributed systems over ML and AI.
Exploring the impact of distributed computing on machine learning and AI through metrics.
Insights from the Data Exchange podcast on topics like scaling language models, applying ML to optimization, and blending data science with domain expertise.

The Tech Buffet #13: Getting a RAG To Work Well Is Hard - 5 Blog Posts To Become a RAG Master

The Tech Buffet • 39 implied HN points • 13 Nov 23

🕹 Technology Machine Learning Artificial Intelligence Software Development Data science Information Retrieval

RAG systems have limitations, like difficulties in effectively retrieving complex information from text. It's vital to understand these limits to use RAGs successfully.
Improving RAG performance involves strategies like cleaning your data and adjusting chunk sizes. These tweaks can help make RAG systems work a lot better.
RAGs may not meet all needs in specialized fields, like insurance, since they sometimes miss important details in lengthy documents. Other methods might be needed for these complex queries.

Why Making a Non-Woke AI Is Actually Very Hard

The Future of Life • 19 implied HN points • 26 Feb 24

🕹 Technology AI Machine Learning Bias Programming Data science Ethics

Language models learn from the data they are trained on, which often includes a lot of left-leaning content, making them reflect that bias.
Adjusting a model's political views is complicated because it involves changing an entire worldview, which can mess up the quality of the responses.
Creating a balanced AI requires new training methods, as current models can’t easily switch perspectives without losing their effectiveness.

When Statistics Lie. Anscombe's Quartet [Math Mondays]

Technology Made Simple • 79 implied HN points • 14 Nov 22

🔬 Science Statistics Data Analysis Visualization Mathematics Data science

Data exploration is crucial in data analysis for gaining useful insights.
Anscombe's quartet showcases how data sets with similar simple stats can have very different distributions.
Visualization is key in spotting patterns, trends, and outliers in data analysis.

How I networked with one of the most prominent Data Scientists online[Storytime Saturdays]

Technology Made Simple • 59 implied HN points • 29 Jan 23

💼 Business Networking Data science Career development

Networking is a valuable skill to add to your toolbox for personal growth, career progression, or assisting others.
Even without an established online presence, you can stand out and network effectively with prominent individuals in your field.
Effective networking can lead to opportunities and contracts that showcase your skills and expertise to the right people.

Vesuvius Challenge is hiring!

Vesuvius Challenge • 9 implied HN points • 21 Jan 25

🕹 Technology Computer Vision Data science Software Engineering Machine Learning Infrastructure

The Vesuvius Challenge is looking for team members to help recover texts from ancient scrolls. They need people for two key roles: research in computer vision and platform engineering.
The computer vision role focuses on using advanced tech to read the scrolls, which involves solving complex problems with CT scan data.
The platform engineering role is about creating tools and systems to manage and share large datasets, making research easier for the community.

Synthetic Data: How to Use LLM to Improve the Performance of LLM (WizardLM)

DataSyn’s Substack • 1 HN point • 27 Aug 24

🕹 Technology Artificial Intelligence Data science Machine Learning Software Development Privacy issues

Synthetic data can help solve problems with real-world data, like data scarcity and privacy issues. By using artificial data, we can create large sets that are safe and more accessible.
The Evol-Instruct method creates complex commands from simpler ones, which leads to richer training data for models. This process helps develop a variety of tasks for AI to learn from.
Training models like WizardLM with synthetic data has shown to improve their performance significantly. It produces better responses compared to many other models, helping AI handle tougher challenges.

How to Invent Math[Technique Tuesdays]

Technology Made Simple • 59 implied HN points • 25 Jan 23

🕹 Technology Math Data science Innovation Problem Solving Metrics

Understand the problem you're dealing with before inventing your own math.
Basic operations like division, multiplication, addition, and subtraction are key for creating new metrics.
Consider existing metrics but be ready to modify or create new ones to better suit your specific needs.

Catastrophic Forgetting In LLMs

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 22 Feb 24

🕹 Technology Artificial Intelligence Machine Learning Data science Software Development Natural Language Processing

Catastrophic forgetting happens when language models forget things they learned before as they learn new information. It's like a student who forgets old lessons when they study new subjects.
Language models can change their performance over time, sometimes getting worse instead of better. This means they can produce different answers for the same question at different times.
Continuous training can make models forget important knowledge, especially in understanding complex topics. Researchers suggest that special training techniques might help reduce this forgetting.

Tech Talks Weekly #6

Tech Talks Weekly • 19 implied HN points • 14 Mar 24

🕹 Technology Software Conferences Data science Engineering Development

Tech Talks Weekly shares recent tech talks from major conferences like Devoxx and NDC. It's a great way to keep updated on the latest in tech.
There's a special edition featuring over 550 talks from Kubernetes conferences. This provides a huge resource for anyone interested in cloud technology.
The newsletter encourages sharing with friends and colleagues to build a community. Spreading the word helps more people connect with the tech talk content.

GroupBy #23: Meta loves Python, How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache

VuTrinh. • 19 implied HN points • 20 Feb 24

🕹 Technology Data Engineering Software Development Data science Artificial Intelligence Cloud Computing

Meta is heavily invested in Python, and they're working on improvements to enhance its performance and usability.
Uber has developed a powerful database called Docstore that can handle over 40 million reads per second, demonstrating their capability in data management.
Data, while useful, doesn't capture the complete reality, and it's important to recognize its limitations in understanding complex scenarios.

The Tech Buffet #9: Let's talk about LLM Hallucinations

The Tech Buffet • 39 implied HN points • 24 Oct 23

🕹 Technology AI Machine Learning Data science Natural Language Processing Software Development

LLMs, or Large Language Models, often produce incorrect or misleading information, known as hallucinations. This happens because they generate text based on probabilities, not actual understanding.
To measure how factually accurate LLM responses are, a tool called FActScore can break down answers into simple facts and check if these facts are true. This helps in gauging the accuracy of the information given by LLMs.
To reduce hallucinations, it's important to implement strategies such as allowing users to edit AI-generated content, providing citations, and encouraging detailed prompts. These methods can help improve the trustworthiness and reliability of the information LLMs produce.

Friday Finds - Bonus Links

Data Science Weekly Newsletter • 19 implied HN points • 16 Feb 24

🕹 Technology AI Data science Programming Ecology Tutorials

There are new tutorials available for those interested in AI and humanities. These tutorials aim to help people learn how to use AI tools effectively.
The Leverhulme Programme is offering opportunities in ecological data science. This program is designed for doctoral training and focuses on important ecological research.
A team is looking to hire a remote R programmer. They want someone to create an easy-to-use package for analyzing complex models in R.

Demonstrate, Search, Predict (DSP) for LLMs

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 19 implied HN points • 16 Feb 24

🕹 Technology AI NLP Machine Learning Data science Software Development

The Demonstrate, Search, Predict (DSP) approach is a method for answering questions using large language models by breaking it down into three stages: demonstration, searching for information, and predicting an answer.
This method improves efficiency by allowing for complex systems to be built using pre-trained parts and straightforward language instructions. It simplifies AI development and speeds up the creation of new systems.
Decomposing queries, known as Multi-Hop or Chain-of-Thought, helps the model reason through questions step by step to arrive at accurate answers.

Data Science Weekly - Issue 479

Data Science Weekly Newsletter • 99 implied HN points • 27 Jan 23

🕹 Technology Data science Artificial Intelligence Machine Learning Programming Statistics

Exploratory programming is important for data teams. It helps them find insights rather than just building software.
Most datasets are not normally distributed, and there are many tests to check this but they can be tricky to use.
AI is gaining a lot of attention, similar to what crypto once had. People are questioning if it can keep that interest alive.

More nuggets on BYD ADAS

TP’s Substack • 6 implied HN points • 24 Feb 25

🕹 Technology Automotive Artificial Intelligence Data science Hardware Software

BYD chose a specific chip setup for its DiPilot-100 platform that supports advanced technology better than other options. They prioritized overall performance and future needs rather than just the highest computing power.
The company collects a large amount of driving data daily, which helps constantly improve its ADAS technology. While it's still behind Tesla’s FSD, BYD's hardware is getting better and offers a good range for detection.
BYD is focusing on reducing costs by developing its own chips and increasing production efficiency. This strategy will help them expand smart car technology to more vehicles and compete effectively in the market.

Smarter Learning Rates, Medical AI, and Compiler Innovations

HackerPulse Dispatch • 2 implied HN points • 20 Dec 24

🕹 Technology AI Machine Learning Medical AI Data science

New learning rate techniques, like SGD-SaI, are making AI training more efficient and using less memory. This means large models can learn better and faster.
AI is showing amazing skills in medical tasks, sometimes even better than doctors, but it still has some limitations in certain areas.
There are advancements in AI compilers that help optimize how programs run, making them more efficient. This is important for developing smarter AI systems.

Who Cares if Big Data Is Dead!

Machine Learning for Developers • 39 implied HN points • 23 Feb 23

🕹 Technology Data Analytics Data Quality Data science Machine Learning Big Data

Data quality and data analytics motives matter more than the size of data.
Big data may not be as prevalent as believed, with most workloads processing only a small amount of data.
Too much data can lead to legal and privacy issues, making data quality paramount.

BYOK

Aipreneur • 39 implied HN points • 08 Mar 23

🕹 Technology AI Data science Machine Learning Cloud Computing Mobile technology

BYOD (Bring Your Own Device) became popular in corporates due to iPhone's rise and employee preferences.
BYOD is beneficial for companies in cost-saving, convenience, increased mobility, and changing workforce demographics.
The emerging trend of BYOK (Bring Your Own Keys) is starting in AI platforms, where users need to pay for keys to access and use data responsibly.

Concordium and 2021.ai Team Up to Make AI More Trustworthy

Concordium Monthly Updates • 39 implied HN points • 20 Jul 23

🕹 Technology AI Blockchain Data science Compliance Trust

Partnership between Concordium and 2021.ai enhances trust in AI through data validation and audit trails.
Integration of Concordium's blockchain into 2021.ai's platform enables new use cases like ESG Validation and MiCA compliance.
Collaboration aims to promote responsible and ethical use of AI, driving innovation and building trust in the AI industry.

Twitter open-sourced their recommendation algorithm

MLOps Newsletter • 39 implied HN points • 09 Apr 23

🕹 Technology Algorithms Machine Learning Open Source Neural Networks Data science

Twitter has open-sourced their recommendation algorithm for both training and serving layers.
The algorithm involves candidate generation for in-network and out-network tweets, ranking models, and filtering based on different metrics.
Twitter's recommendation algorithm is user-centric, focusing on user-to-user relationships before recommending tweets.

Introduction to R for Data Science (Part Seven Final)

The Software & Data Spectrum • 39 implied HN points • 06 Apr 23

🕹 Technology Data science Visualization Themes

Boxplots are common for visualizing data like stock pricing, and you can customize them with colors and flips.
Variable plotting can include heat maps to show occurrences, and you can adjust the appearance with features like scale_fill_gradient().
Coordinate your graphs using functions like coord_cartesian() and facet them based on specific variables for more detailed insights.

Introduction to R for Data Science (Part Five)

The Software & Data Spectrum • 39 implied HN points • 30 Mar 23

🕹 Technology Data science Data Manipulation Programming

Using apply functions in R like lapply and sapply can help apply functions to elements in a vector or list.
Math functions in R like abs(), sum(), mean(), and round() are useful for basic calculations and rounding numbers.
Data manipulation in R using dplyr involves functions like filter(), arrange(), select(), and mutate() to filter, sort, and create new columns in datasets.

Introduction to R for Data Science (Part Six)

The Software & Data Spectrum • 39 implied HN points • 02 Apr 23

🚌 Education Data science Data Visualization

The pipe operator is a helpful tool in R for chaining operations together neatly.
Tidyr is a package in R that helps clean and gather data efficiently.
Data visualization in R can be enhanced with ggplot2 for creating visually appealing histograms, scatterplots, and barplots.

How Hugging Face and Kaggle Bolster the Open Source Machine Learning Community

The Strategy Deck • 39 implied HN points • 26 Jul 23

🕹 Technology Machine Learning Open Source Data science ML models Model Deployment

Open source ML hubs like Hugging Face and Kaggle provide platforms for managing, sharing, and deploying ML models.
Hugging Face focuses on models, datasets, deployment infrastructure, and community engagement.
Kaggle empowers learners, developers, and researchers with educational resources, open source models, and a competitive platform.

🥟 Chao-Down #70 ChatGPT reads financial headlines and Federal Reserve speeches, Google uses generative AI for ads, IBM and Moderna develop vaccines with AI and quantum computing

Chaos Theory • 39 implied HN points • 24 Apr 23

🕹 Technology AI Quantum Computing Data science Wearable Tech Machine Learning

ChatGPT reads financial headlines and Federal Reserve speeches for prediction
Google employs generative AI for advanced ad campaigns
IBM and Moderna collaborate on AI and quantum computing for vaccine development

Python Powers Excel

Sector 6 | The Newsletter of AIM • 39 implied HN points • 24 Aug 23

🕹 Technology Software Data science Analytics Machine Learning Programming

Python is now integrated into Excel, making it easier for users to blend Excel's tools with Python's capabilities.
This allows users to perform advanced tasks like data visualization and machine learning directly in Excel.
The integration works well with existing Excel features, so users can still use familiar functions like formulas and charts.

For the Love of PyTorch

Sector 6 | The Newsletter of AIM • 39 implied HN points • 04 Sep 23

🕹 Technology AI Software Neural Networks Deep Learning Data science

PyTorch is a key player in the development of AI, particularly large language models (LLMs). Its flexibility makes it great for deep learning experiments.
The framework supports GPUs really well and allows for easy updates to computation graphs during programming.
In 2022, PyTorch had a significant edge on platforms like Hugging Face, with 92% of models being PyTorch-exclusive compared to just 8% for TensorFlow.

XGBoost is the Secret of ML Energy

Sector 6 | The Newsletter of AIM • 39 implied HN points • 06 Sep 23

🕹 Technology Artificial Intelligence Machine Learning Data science Algorithms Computing

XGBoost, or Extreme Gradient Boosting, helps improve the performance and speed of machine learning models that deal with tabular data. It's known for being really good at finding patterns and making predictions.
This algorithm works best for supervised learning when you have lots of training examples, especially when you have both categorical and numeric data. It can handle a mix of different data types well.
If you're working with a dataset that has many features, XGBoost is a strong choice to enhance the capabilities of your machine learning model. It makes it easier to get accurate results.

Google’s Missed AI Opportunity

Sector 6 | The Newsletter of AIM • 39 implied HN points • 31 Aug 23

🕹 Technology AI Machine Learning Data science Innovation Startups

Google missed a huge chance by overlooking the Transformer paper in 2017, which cost them around $6.2 billion. This mistake allowed others to build successful AI startups.
The authors of the Transformer paper have moved on to create their own companies, showing the impact of their work and how they’ve found success after leaving Google.
Such missed opportunities highlight the importance of recognizing and supporting innovative research within companies like Google.

The Counterfactual's poll #2

The Counterfactual • 19 implied HN points • 05 Feb 24

🕹 Technology AI Data science Research Communication Language Models

Subscribers can vote each month on research topics. This helps decide what the writer will explore next based on community interest.
The upcoming projects mostly focus on how Large Language Models (LLMs) can measure or modify readability. Some topics might take more than a month to research thoroughly.
One of the suggested studies looks at whether AI responses vary by month, testing if it seems 'lazier' in December compared to other months.

BizML: a framework for success in applied machine learning

TechTalks • 19 implied HN points • 05 Feb 24

🕹 Technology AI Machine Learning Data science Business Framework

Most machine learning projects fail due to a gap in understanding between data scientists and business professionals.
Eric Siegel introduces bizML, a six-step framework for successful machine learning projects that emphasizes starting with the end business goal.
Improving human understanding and leadership is crucial for the success of advanced technologies like machine learning.

Understanding The Role of Vector DB in AI Application

The Beep • 19 implied HN points • 04 Feb 24

🕹 Technology AI Databases Software Applications Data science

Vector databases are designed to handle complex and unstructured data, making them great for AI applications like semantic search and face recognition. They convert information into high-dimensional vectors that are easy to work with.
Unlike traditional databases, vector databases can manage different types of data such as text, images, and audio, which makes them very versatile. They're like a Swiss Army knife for managing data.
Vector databases play a crucial role in enhancing AI capabilities, providing better access and analysis of data, which leads to smarter applications, including smart assistants and more.

Friday Finds - Bonus Links

Data Science Weekly Newsletter • 19 implied HN points • 02 Feb 24

🕹 Technology AI Machine Learning Data science Software Innovation

Paid subscribers get extra links and content. It's a nice way to say thank you for their support.
There are interesting discussions on topics like AI and machine learning. These conversations help people learn more about the field.
Links to simulations and insights about reality powered by AI are shared. They could spark curiosity and understanding about modern technology.