The hottest Data science Substack posts right now

And their main takeaways
Category
Top Technology Topics
Data Science Weekly Newsletter 419 implied HN points 22 Dec 23
  1. Generative AI is changing how we work with tools, improving the Human-Tool Interface. This can help us use technology in ways we never could before.
  2. Support Vector Machines (SVMs) can be very effective for prediction tasks, often outperforming other models in error rates. However, they aren’t as commonly used, possibly due to their complexity.
  3. Deep multimodal fusion is useful in surgical training. It helps classify feedback from experienced surgeons to trainees by combining different types of data like text, audio, and video.
Mindful Modeler 339 implied HN points 23 Jan 24
  1. Quantile regression can be used for robust modeling to handle outliers and predict tail behavior, helping in scenarios where underestimation or overestimation leads to loss.
  2. It is important to choose quantile regression when predicting specific quantiles, such as upper quantiles, for scenarios like bread sales where under or overestimating can have financial impacts.
  3. Quantile regression can also be utilized for uncertainty quantification, and combining it with conformal prediction can improve coverage, making it useful for understanding and managing uncertainty in predictions.
Interconnected 246 implied HN points 18 Nov 24
  1. The scaling law for AI models might be losing effectiveness, meaning that simply using more data and compute power may not lead to significant improvements like it did before.
  2. US export controls on AI technology may become less impactful over time, as diminishing returns on AI model scaling could lessen the advantages of having the most advanced hardware.
  3. If AI development slows down, the urgency for a potential 'AI doomsday' scenario may decrease, allowing for a more balanced competition between the US and China in AI advancements.
John Ball inside AI 39 implied HN points 24 Jul 24
  1. You don't need many words to communicate in a new language. Just a small vocabulary can help you get by in everyday conversations.
  2. For understanding most spoken and written text, around 2000 words are usually enough. This covers about 80% of regular communication.
  3. Machine learning and AI can benefit from understanding language like humans do, by learning new words in context rather than just relying on a large vocabulary.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Tech Talks Weekly 59 implied HN points 26 Jul 24
  1. Tech Talks Weekly is a free email newsletter that shares recent talks from dozens of tech conferences. It's a great way to catch up on what you missed!
  2. Readers can participate by filling out a short form to help improve the content. This makes it a community-driven resource.
  3. The newsletter highlights popular talks each week, making it easier for people to discover valuable insights from experts in tech.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 119 implied HN points 16 May 24
  1. AI agents can make decisions and take actions based on their environment. They operate at different levels of complexity, with level one being simple rule-based systems.
  2. Currently, AI agents are improving rapidly, sitting at levels two and three, where they can automate tasks and manage sequences of actions effectively.
  3. The future of AI agents is bright, as they will be more integrated into various industries, but we need to consider issues like accountability and ethics when designing and implementing them.
Data Science Weekly Newsletter 139 implied HN points 03 May 24
  1. Reusing data analysis work can save time and help teams focus on building new capabilities instead of just repeating old ones.
  2. Open-source models can be a better choice than proprietary ones for developing AI applications, making them cheaper and faster.
  3. Causal machine learning helps predict treatment outcomes by personalizing clinical decisions based on individual patient data.
Year 2049 13 implied HN points 17 Jan 25
  1. AI systems learn from data, so the quality of that data is really important. Better data means smarter machines.
  2. Machines can become biased if they are trained on biased data. It's important to watch out for this when developing AI.
  3. This is just one part of a series explaining AI. More episodes will cover different aspects of how machines learn and behave.
TheSequence 175 implied HN points 09 Dec 24
  1. RAG techniques combine the power of language models with external data to improve accuracy. This means AI can give better answers by using real-world information.
  2. Advanced methods like Small to Slide RAG make it easier for AI to work with visual data, like slides and images. This helps AI understand complex information that is not just text.
  3. ColPali is a new approach that focuses on visuals directly, avoiding mistakes from converting images to text. It's useful for areas like design and technical documents, ensuring important details are not missed.
The Counterfactual 599 implied HN points 28 Jul 23
  1. Large language models, like ChatGPT, work by predicting the next word based on patterns they learn from tons of text. They don’t just use letters like we do; they convert words into numbers to understand their meanings better.
  2. These models handle the many meanings of words by changing their representation based on context. This means that the same word could have different meanings depending on how it's used in a sentence.
  3. The training of these models does not require labeled data. Instead, they learn by guessing the next word in a sentence and adjusting their processes based on whether they are right or wrong, which helps them improve over time.
Data Science Weekly Newsletter 119 implied HN points 10 May 24
  1. Time-series analysis and Gaussian processes are powerful tools for interpreting data. They allow for flexibility and control in modeling data, making them essential for data practitioners.
  2. Understanding A/B testing is crucial for making informed business decisions. Using a reliable experimentation system can save time and lead to better results.
  3. New advancements in AI and data science are enhancing applications in various fields, like biomedical research and recommendation systems. These innovations help combine human creativity with machine learning capabilities.
The AI Frontier 119 implied HN points 09 May 24
  1. Open LLMs, like Llama 3, are getting really good and can perform well in many tasks. This improvement makes them a strong option for various applications.
  2. Fine-tuning open LLMs is becoming more attractive because of their improved quality and lower costs. This means smaller, specialized models can be more easily developed and used.
  3. However, open models likely won't surpass OpenAI's offerings. The proprietary models have a big advantage, but open LLMs can still thrive by focusing on efficiency and specific use cases.
Data Science Weekly Newsletter 179 implied HN points 29 Mar 24
  1. SQL is seen as an easier way to write relational algebra, but it's not ideal for building new query tools. Understanding its limits can help in learning and using SQL better.
  2. Many successful companies have developed their own AI models, showing a trend in the tech industry. Knowing about these companies can give insights into future developments in AI.
  3. Binary vector search methods can save a lot of memory compared to traditional methods. However, it's important to balance memory savings with maintaining accuracy.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 18 Jul 24
  1. Large Language Models (LLMs) can create useful text but often struggle with specific knowledge-based questions. They need better ways to understand the question's intent.
  2. Retrieval-augmented generation (RAG) systems try to solve this by using extra knowledge from sources like knowledge graphs, but they still make many mistakes.
  3. The Mindful-RAG approach focuses on understanding the question's intent more clearly and finding the right context in knowledge graphs to improve answers.
Owen’s Substack 59 implied HN points 19 Jul 24
  1. Triplex is a new tool that helps create knowledge graphs quickly and cheaply. It's much cheaper to use than older methods, making it easier for more people to utilize.
  2. This tool is small enough to run on regular laptops, which means you don't need powerful computers to build knowledge graphs. This makes technology more accessible to everyone.
  3. Triplex is open-source, allowing anyone to use and improve it. The community can experiment with it freely and innovate new ways to organize and understand information.
Data Science Weekly Newsletter 199 implied HN points 14 Mar 24
  1. Serverless computing can handle big tasks without limits, but it also brings challenges like managing large uploads effectively.
  2. Art careers can be influenced by the reputation of institutions, with established artists facing less access to elite spaces early on compared to newcomers.
  3. Learning about LLM evaluation metrics can help improve understanding and performance when working with large language models.
Steve Kirsch's newsletter 6 implied HN points 18 May 25
  1. The KCOR method is a new, simple technique to analyze how different interventions, like vaccines, affect outcomes such as mortality. It uses basic data like date of birth, date of death, and vaccination date to provide clear results.
  2. The analysis suggests that COVID vaccines may have increased mortality rates, indicating the vaccines could be more harmful than helpful. This counters many previous claims about the vaccines saving lives.
  3. KCOR is designed to be objective and straightforward, allowing for accurate comparisons without needing complex data adjustments, making it a powerful tool for understanding health interventions.
Interconnected 138 implied HN points 03 Jan 25
  1. DeepSeek-V3 is an AI model that is performing as well or better than other top models while costing much less to train. This means they're getting great results without spending a lot of money.
  2. The AI community is buzzing about DeepSeek's advancements, but there seems to be less excitement about it in China compared to outside countries. This might show a difference in how AI news is perceived globally.
  3. DeepSeek has a few unique advantages that set it apart from other AI labs. Understanding these can help clarify what their success means for the broader AI competition between the US and China.
HackerPulse Dispatch 5 implied HN points 31 Jan 25
  1. LLM-AutoDiff can make AI workflows more efficient by automatically optimizing prompts, leading to better performance without the need for manual work.
  2. Racing for superintelligence might cause more problems than it solves, making cooperation between nations a better option.
  3. Combining reinforcement learning with transformers can create AI that adapts and solves new problems effectively over time.
The AI Frontier 159 implied HN points 04 Apr 24
  1. Current methods for evaluating language models (LLMs) are not effective because they try to give one-size-fits-all answers. Each LLM is better suited for different tasks, so we need evaluations that reflect that.
  2. It’s important to look at specific skills of LLMs, like how well they follow instructions or retrieve information. This will help users understand which model works best for their needs.
  3. We need more detailed benchmarks that assess individual capabilities rather than general performance scores. This way, developers can make smarter choices when selecting LLMs for their projects.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 13 Aug 24
  1. RAG Foundry is an open-source framework that helps make the use of Retrieval-Augmented Generation systems easier. It brings together data creation, model training, and evaluation into one workflow.
  2. This framework allows for the fine-tuning of large language models like Llama-3 and Phi-3, improving their performance with better, task-specific data.
  3. There is a growing trend in using synthetic data for training models, which helps create tailored datasets that match specific needs or tasks better.
TheSequence 98 implied HN points 21 Jan 25
  1. RAG stands for Retrieval Augmented Generation. It's a way for machines to pull in outside information, helping them give better and more accurate answers.
  2. There are many kinds of RAG, like Standard RAG and Fusion RAG. Each type helps machines deal with different problems and has its special strengths.
  3. Understanding these RAG types is important for anyone working in AI. It helps them choose the right approach for different challenges.
Data Science Weekly Newsletter 359 implied HN points 15 Dec 23
  1. Learning about causal models is important in data analysis because it helps explain what caused the data. This understanding can improve how we interpret results using Bayesian methods.
  2. There's growing concern over data privacy in AI tools like Dropbox. Users are worried their private files could be used for AI training, even though companies deny this.
  3. Netflix recently held a Data Engineering Forum to share best practices. They discussed ways to improve data pipelines and processing, which could benefit many in the data engineering community.
TheSequence 77 implied HN points 04 Feb 25
  1. Corrective RAG is a smarter way of using AI that makes it more accurate by checking its work. It helps prevent mistakes or errors in the information it gives.
  2. This method goes beyond basic retrieval-augmented generation (RAG) by adding feedback loops that refine and improve the output as it learns.
  3. The goal of Corrective RAG is to provide answers that are factually accurate and coherent, reducing confusion or incorrect information.
Data Science Weekly Newsletter 139 implied HN points 12 Apr 24
  1. This newsletter provides links and updates about data science, AI, and machine learning. It's a helpful resource for anyone wanting to stay informed in this field.
  2. One article teaches how to handle real questions using Python, which is great for people wanting practical coding skills. Another discusses techniques to make sure AI outputs stay on task.
  3. The newsletter also features resources and courses to help people learn and improve their skills in data science and related areas. It's a good place to find learning opportunities.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 39 implied HN points 10 Jul 24
  1. Using Chain-Of-Thought prompting helps large language models think through problems step by step, which makes them more accurate in their answers.
  2. Smaller language models struggle with Chain-Of-Thought prompting and often get confused because they don't have enough knowledge and understanding like the bigger models.
  3. Google Research has a method to teach smaller models by learning from larger ones. This involves using the bigger models to create helpful examples that the smaller models can then learn from.
The Data Ecosystem 119 implied HN points 21 Apr 24
  1. Data can be really complicated, and it's easy to miss how everything connects. People often focus on their own area and forget about the bigger picture of the data ecosystem.
  2. Chief Data Officers (CDOs) are important but can only do so much to fix data issues. They deal with many challenges, including limited power, lack of experience, and politics within the organization.
  3. To improve in the data field, we need to recognize the gaps in our knowledge, prioritize what to focus on, and continuously educate ourselves in both our own areas and related data domains.
ppdispatch 5 implied HN points 16 May 25
  1. The 'Leaderboard Illusion' highlights how some AI models get unfair rankings because of selective information sharing. This can make it hard to know which models are truly the best.
  2. Large Language Models (LLMs) struggle a lot in long conversations, with a big drop in their performance. They often lose track of conversations and can make mistakes early on that affect the whole chat.
  3. MiniMax-Speech is a new tech for turning text into speech that can imitate voices in multiple languages. It also allows for cool features like expressing emotions in the voice.
Data Analysis Journal 452 implied HN points 26 Jul 23
  1. The author reflects on three years of writing a newsletter about analytics, thanking supporters and subscribers.
  2. The author's newsletter aims to document their journey, bridge the gap between academics and industry, and encourage classic data analysis.
  3. The author shares insights on their writing strategy, the power of being small and independent, and future plans for the newsletter.
Data Science Weekly Newsletter 339 implied HN points 01 Dec 23
  1. Data science is evolving quickly, and it's important to stay updated with new advances and tools. Courses and reading lists can help you catch up and enhance your skills.
  2. Using machine learning to solve real-world problems, like correctly attributing quotes, shows the practical applications of data science. Collaboration between universities and organizations can lead to innovative solutions.
  3. The job market for data scientists is challenging right now. Many applicants are competing for limited positions, so if you're looking for a job, patience is key.
Data Science Weekly Newsletter 179 implied HN points 01 Mar 24
  1. The DSPy framework makes working with large language models easier by focusing on programming instead of complex prompting techniques. This helps reduce errors and improves usability.
  2. A new sequence model approach shows better performance than traditional Transformers, especially for long data sequences. It also works faster, making it a promising development in the field.
  3. Learning resources like online courses and free books on deep learning and causal ML can help deepen understanding of data science. They provide structured material that is great for both beginners and advanced learners.
VuTrinh. 59 implied HN points 11 Jun 24
  1. Meta has developed a serverless Jupyter Notebook platform that runs directly in web browsers, making data analysis more accessible.
  2. Airflow is being used to manage over 2000 DBT models, which helps teams create and maintain their own data models effectively.
  3. Building a data platform from scratch can be a valuable learning experience, revealing important lessons about data structure and management.
Mindful Modeler 419 implied HN points 19 Sep 23
  1. For imbalanced classification tasks, 'Do Nothing' should be the default approach, especially when dealing with calibration, strong classifiers, and class-based metrics.
  2. Addressing imbalanced data should be considered in scenarios where misclassification costs vary, metrics are impacted by imbalance, or weaker classifiers are used.
  3. Instead of using oversampling methods like SMOTE, adjusting data weighting, using cost-sensitive machine learning, and threshold tuning are more effective ways to handle class imbalance.
Data Science Weekly Newsletter 339 implied HN points 17 Nov 23
  1. JAX is becoming popular for its speed and capabilities, and learning it may be essential for those familiar with PyTorch. It does have a steeper learning curve, but there are resources to help ease the transition.
  2. The demand for GPUs is skyrocketing, driven by various market factors. Understanding these dynamics can help anticipate the future of technology and resource availability in industries reliant on powerful computing.
  3. Freelancing in data science can lead to an overwhelming number of job offers. Tips on finding clients on platforms like Upwork and LinkedIn can help navigate this new freelance landscape.
Data Science Weekly Newsletter 379 implied HN points 27 Oct 23
  1. Web development is evolving with the use of local models and technologies for building applications, moving beyond just Python-based machine learning.
  2. It's becoming increasingly important for developers to understand GPUs since they're widely used in deep learning and can greatly enhance performance.
  3. Companies are exploring various use cases for generative AI that provide real value, focusing on practical implementations that drive return on investment.
Data Science Weekly Newsletter 219 implied HN points 26 Jan 24
  1. AI often gets criticized for the quality of its output, but that might not be the real issue people have with it. If quality is fixed, the conversation about AI could change significantly.
  2. Common sense is tricky to define and measure, but researchers are developing ways to quantify it both individually and collectively. This could help clarify how we understand common sense in different contexts.
  3. Large language models (LLMs) can transform education by encouraging hands-on learning. They offer opportunities for more interactive and engaging learning experiences.