The hottest Data science Substack posts right now

And their main takeaways

HN blogs - 30/10/24

HackerNews blogs newsletter • 0 implied HN points • 30 Oct 24

🕹 Technology Data science

Upgrading tech can be simpler than it seems. One person managed to upgrade their project from Rails 7 to Rails 8 in just 30 minutes.
Project management practices like Scrum can be improved. It's possible to adopt better methods that actually make the process easier for everyone involved.
There are many useful tools and techniques in web development. Learning about things like PostgreSQL pagination or certificate authentication can really enhance your skills.

HN blogs - 22/10/24

HackerNews blogs newsletter • 0 implied HN points • 22 Oct 24

🕹 Technology Data science

Passkeys are seen as a potential improvement over passwords for logging in, but they may come with their own set of problems.
The latest trends in CSS3 animations show exciting developments for web design, keeping it fresh and engaging.
There's continuous innovation in speech-to-text technology, making it more efficient and user-friendly.

HN blogs - 17/10/24

HackerNews blogs newsletter • 0 implied HN points • 17 Oct 24

🕹 Technology Data science

Teaching kids to code may not be necessary for everyone. It's important to focus on what interests them instead.
Cognitive load is crucial in learning and productivity. We should manage it well to maximize our effectiveness.
Self-hosting can provide valuable lessons about control and independence in managing technology.

HN blogs - 13/10/24

HackerNews blogs newsletter • 0 implied HN points • 13 Oct 24

🕹 Technology Data science

Understanding how beauty influences our lives can help us appreciate its role in society. It’s about recognizing beauty as a meaningful aspect of our existence.
Learning how to effectively use LLMs can streamline the development process. This method, called TDD, helps ensure that your code is reliable and efficient.
Exploring ways to block unwanted content in web browsers can improve user experience. This is particularly important as technology evolves and new challenges arise.

HN blogs - 7/19/24

HackerNews blogs newsletter • 0 implied HN points • 07 Oct 24

🕹 Technology Data science

Founder mode is a mindset some entrepreneurs adopt to stay focused and motivated. It helps prioritize tasks and manage time effectively while building a business.
Private equity can harm tech companies by pushing for quick profits rather than long-term growth. This can lead to a decline in innovation and company culture.
Mental fitness workshops aim to improve mental well-being and resilience. They often include practical exercises to help participants handle stress better.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

HN blogs - 4/10/24

HackerNews blogs newsletter • 0 implied HN points • 04 Oct 24

🕹 Technology Data science

Staying motivated can be tough, but there are ways to break free from a rut and find inspiration again.
Exploring financial advice using AI like ChatGPT can provide new perspectives and ideas for managing money.
Understanding the importance of hiring highly skilled engineers can significantly impact the success of a project or business.

Mastering DataFrame Joins in Spark: A Comprehensive Guide with Examples

DataSketch’s Substack • 0 implied HN points • 23 Jul 24

🕹 Technology Data science

DataFrames in Spark are like tables for big data. They help people work with large datasets efficiently across different computers.
There are several types of joins in Spark, such as inner, left, right, and full outer joins. Each type has a specific way of combining data from two DataFrames.
Setting up Spark is easy. You can install it, write a few lines of code to create DataFrames, and start joining data for analysis.

Choosing the Right SQL Technique to Transform Your Data Analysis

DataSketch’s Substack • 0 implied HN points • 24 Jun 24

🕹 Technology Data science

CTEs help make complex queries easier to read and are good for breaking down hierarchical data. But be careful not to use them too much, as they can slow things down.
Subqueries are useful for filtering and aggregating data, but they can be hard to read and slow if used in a complicated way. They work best for specific tasks in a query.
Temporary views are great for creating reusable logic that only lasts for the session. However, they can't be used outside of that session, so plan accordingly.

Data Modeling for Data Engineering

DataSketch’s Substack • 0 implied HN points • 18 Mar 24

🕹 Technology Data science

Data modeling is like creating a map for organizing and finding data easily. It helps keep everything tidy and accessible.
There are three types of data models: conceptual, logical, and physical, each serving different levels of detail in planning data structure.
A practical example is organizing a library, where the models help define books, authors, and loans, ensuring everything links and works smoothly.

Explaining ResNets in simple, plain language

Jon’s Substack • 0 implied HN points • 25 Mar 24

🕹 Technology Data science

ResNets help make deep neural networks easier to train by smoothing the loss landscape. This makes it simpler for optimization algorithms to find the best solutions.
The main idea behind ResNets is to add 'skip connections' between layers, allowing the network to learn identity functions. This means that if a layer isn’t helpful, it won't negatively impact learning.
As networks get deeper, ResNets adjust their weights to limit changes in representations. This keeps the performance consistent, preventing problems like overfitting and improving accuracy.

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset

machinelearninglibrarian • 0 implied HN points • 23 Sep 24

🕹 Technology Data science

ColPali is a new model that combines text and images to improve how we find documents. It looks at both the words and the visual parts of a page, making it smarter than older text-only methods.
To train ColPali, we need a dataset that pairs document images with questions about what those documents contain. This helps the model learn how to match questions with the right visual information.
Using a special model called Qwen2-VL, we can create specific and relevant queries from images. This can help refine the dataset even more by making sure the questions are useful for retrieving information.

Synthetic dataset generation techniques: generating custom sentence similarity data

machinelearninglibrarian • 0 implied HN points • 23 May 24

🕹 Technology Data science

Large Language Models (LLMs) can help create synthetic datasets for training models, especially where there's a lack of real data. This approach makes it easier to gather specific information needed for tasks like text classification.
Generating sentence similarity data helps in comparing how alike two sentences are. This is useful in areas like information retrieval and clustering.
A structured approach to generating data can improve the quality and relevance of the data produced. Using prompts to control the output can help generate more accurate results for specific training needs.

Tracing Text Generation Inference calls

machinelearninglibrarian • 0 implied HN points • 05 Apr 24

🕹 Technology Data science

To trace text generation calls, you can use Langfuse with OpenAI integration in your code. This allows you to monitor how your text generation model is performing.
You'll need to set up your secret keys and environment variables to connect to the Langfuse service. Make sure to store your sensitive keys securely.
The example provided shows how to make a chat completion call and receive responses from a model. It's a handy way to see how AI can generate text based on user input.

Extracting Insights from Model Cards Using Open Large Language Models

machinelearninglibrarian • 0 implied HN points • 27 Nov 23

🕹 Technology Data science

Model Cards are important for sharing details about machine learning models, but they can vary greatly in format and focus. This makes it hard to know how to find or categorize the information they contain.
There are over 400,000 models on the Hugging Face Hub, and extracting specific details, like the datasets used or evaluation metrics mentioned, could help create clearer guidelines and metadata.
Using open large language models can help annotate and discover key concepts from the diverse data in Model Cards, making it easier to analyze and understand various models and their attributes.

How to do groupby for Hugging Face datasets

machinelearninglibrarian • 0 implied HN points • 18 Sep 23

🕹 Technology Data science

Hugging Face's datasets don't have built-in groupby features, but you can use Polars to handle this. You can load datasets with Polars and perform group operations easily.
Polars allows you to work with large datasets efficiently using lazy evaluation. This means you can process data without needing to load everything into memory all at once.
You can visualize data comparisons after grouping by specific columns, making it easier to understand patterns or insights from the data.

Exploring language metadata for datasets on the Hugging Face Hub

machinelearninglibrarian • 0 implied HN points • 07 Jun 23

🕹 Technology Data science

The Hugging Face Hub provides datasets that can be filtered based on available language metadata. It helps identify which datasets contain specific language information.
There are many languages represented in the datasets, with a total of 1719 unique languages noted. This diversity is important for developing models that support different languages.
Visual tools like bar charts and word clouds can effectively represent language frequencies in datasets. These visuals make it easier to understand the distribution and popularity of different languages.

Dynamically updating a Hugging Face hub organization README

machinelearninglibrarian • 0 implied HN points • 07 Mar 23

🕹 Technology Data science

You can use the huggingface_hub library to automatically create and update a README for your Hugging Face organization. This helps keep your information organized without needing to make manual changes.
By listing and grouping datasets by tasks, it makes it easy to see what datasets are available for different activities. This organization helps others find the resources they need quickly.
Using a templating engine like Jinja2 allows you to create a polished and updated README format. It makes the information visually appealing and easier to understand.

Using Hugging Face AutoTrain to train an image classifier without writing any code.

machinelearninglibrarian • 0 implied HN points • 22 Feb 23

🕹 Technology Data science

You can train an image classifier with Hugging Face AutoTrain without needing to write any code. This makes it easier for people who aren't programmers to use machine learning.
Image classification is useful for organizing images into categories, like sorting book covers into 'useful' or 'not useful'.
The success of your model often depends more on having good training data than on the model itself. Adjusting and improving your training data can lead to better results.

A (very brief) intro to exploring metadata on the Hugging Face Hub

machinelearninglibrarian • 0 implied HN points • 16 Jan 23

🕹 Technology Data science

The Hugging Face Hub is a key place for sharing machine learning models and datasets. Finding the right model or dataset can be tough as the number grows, but using metadata can help make the search easier.
You can interact with the Hugging Face Hub programmatically using the `huggingface_hub` library. This library allows you to list datasets and models easily, and it has various features that can help developers.
Exploring tags associated with models and datasets on the Hub is important. Tags provide additional information about the purpose and compatibility of models, but counting them can be misleading without considering their context.

Label Studio x Hugging Face datasets hub

machinelearninglibrarian • 0 implied HN points • 07 Sep 22

🕹 Technology Data science

Using Label Studio and Hugging Face datasets helps in annotating data more efficiently for machine learning tasks. This makes it easier to move back and forth between annotating, training a model, and refining the process.
The Hugging Face hub allows for easier management of large datasets due to its Git-based structure, which also supports versioning. This means you can track changes and update your dataset as you annotate more data.
Creating a loading script for your dataset helps integrate the data into your machine learning pipeline. You can share the dataset easily while ensuring you only load the necessary data based on your annotations.

Training an object detection model using Hugging Face

machinelearninglibrarian • 0 implied HN points • 16 Aug 22

🕹 Technology Data science

Object detection helps identify and locate objects in images. It goes beyond just knowing if something is present; it tells us where and how many of those things are there.
Hugging Face offers tools for training object detection models easily, especially using the Detr architecture. This lets users leverage pre-trained models and datasets for better performance.
Using the datasets library simplifies the data handling process during training. It allows for quick loading and preparation of data, which is very helpful when tweaking and iterating on models.

Searching for machine learning models using semantic search

machinelearninglibrarian • 0 implied HN points • 26 Jul 22

🕹 Technology Data science

There are a lot of machine learning models available on platforms like Hugging Face, but finding the right one can be tricky. You may need to search through different tags and descriptions to find what fits your need.
Using semantic search can help you find models based on what they can do rather than just their names. This way, you can discover models that are similar even if they use different terms.
Documenting models in README files is important because it helps others understand how to use them. However, not all models have detailed documentation, which can make finding the right one harder.

Combining Hugging Face datasets with dask

machinelearninglibrarian • 0 implied HN points • 20 Jun 22

🕹 Technology Data science

Hugging Face datasets help you load, process, and share data easily, but they can be tricky for exploring data. Using Dask together with Hugging Face makes data analysis smoother, especially for larger datasets.
Dask allows you to run operations in parallel, which is useful if your data can't fit into memory. You can use Dask's different collection types, like dask bag, to process data efficiently by breaking it into smaller chunks.
Dask dataframes work like pandas dataframes, making it easier to perform complex operations. This includes grouping data and calculating averages, which you can visualize just like you would with pandas.

Using 🤗 datasets for image search

machinelearninglibrarian • 0 implied HN points • 13 Jan 22

🕹 Technology Data science

You can use the Hugging Face datasets library to create an image search application easily, allowing you to search images effectively.
The library supports different ways to handle images, like reading from file paths or NumPy arrays, which makes it flexible for usage.
It's important to consider potential biases and performance variability when deploying models for image searches, especially with varied datasets.

TTW Extra #8 🔥: PyCon deep dives a.k.a. "Tutorials". Getting started with Polars, Building your first API with Django, NLP in Python from scratch, Python concurrency 101, ...

Tech Talks Weekly • 0 implied HN points • 30 Oct 24

🕹 Technology Data science

PyCon has started offering longer format talks called 'Tutorials' since 2020, which allow for in-depth learning on various subjects.
There are many great tutorials available on topics like starting with Polars, building APIs with Django, and learning NLP in Python.
The talks are categorized by year and popularity, making it easy to find the most watched ones or specific topics that interest you.

💥 Tech Talks Weekly #21

Tech Talks Weekly • 0 implied HN points • 04 Jul 24

🕹 Technology Data science

This weekly newsletter shares new tech talks from various conferences to keep you updated. It's a great way to discover fresh content on technology topics.
You can subscribe for free and join a community of over 1400 readers. It's easy to unsubscribe if you want, and there's no spam.
Featured talks include important topics like legacy code migrations and deep learning. Watching these talks can help enhance your understanding and skills in tech.

Tech Talks Weekly #16 ☀️☀️☀️

Tech Talks Weekly • 0 implied HN points • 30 May 24

🕹 Technology Data science

Tech Talks Weekly shares recent uploads from multiple tech conferences to help you stay updated.
You can support this initiative by telling others about it and participating in a survey to improve the content.
There are many interesting talks available, covering diverse topics in tech that you can watch to learn more.

TTW Extra #4 🔥: All Python conference talks from 2023 ordered by the number of views

Tech Talks Weekly • 0 implied HN points • 09 Apr 24

🕹 Technology Data science

There are a lot of Python conference talks available from 2023, with many options to choose from. You can find talks on different topics and technologies.
The engagement with these talks is high, with some having over 12,000 views. This shows a strong interest in learning and sharing knowledge within the Python community.
Tech Talks Weekly is building a community around tech talks and encourages sharing with others to help spread the word. Following them on social media can keep you updated on the best talks to watch.

AI Weekly Update - October 18

Handy AI • 0 implied HN points • 18 Oct 24

🕹 Technology Data science

OpenAI has launched a new ChatGPT app for Windows, which allows users to upload files but does not yet have voice features.
NVIDIA has introduced a new AI model called Llama-3.1-Nemotron, which is said to be more powerful and accurate than previous models.
Google’s NotebookLM has added new features for creating personalized AI-generated podcasts, allowing users to customize topics and expertise levels.

Using Genetic Algorithms to solve any problem

The Halfway Point • 0 implied HN points • 26 Apr 24

🕹 Technology Data science

Genetic algorithms are useful tools for solving various problems because they adapt well and can be implemented easily. They help find good solutions, even if those solutions aren't always the absolute best.
When using genetic algorithms, it's important to define three key elements: the system, the cost function, and how the system should change to minimize costs. This helps organize and optimize the problem-solving process.
The DEAP library for Python makes it simple to create and manage genetic algorithms. It provides tools to easily track progress and make the necessary adjustments during the optimization process.

Algorithms for Optimization (Explained Simply): Part 2 - Line Search and the Trust Region Method

Photon-Lines Substack • 0 implied HN points • 25 Oct 24

🕹 Technology Data science

The Line Search method helps find a minimum by choosing a direction to step and adjusting step size until a local minimum is reached. It's like walking downhill one small step at a time.
Approximate line search is quicker and doesn’t require finding the perfect step size. Instead, it focuses on taking good enough steps to keep moving closer to the minimum without wasting time.
The Trust Region method keeps steps within a 'trust zone' where the function behaves predictably. If the prediction is accurate, the zone expands; if not, it shrinks, helping to avoid large, risky moves.

AI Grandmaster, Pattern Recognition, Pruning, and Open-Source Innovation

ppdispatch • 0 implied HN points • 25 Oct 24

🕹 Technology Data science

A new chess AI shows that it can play at a grandmaster level just by recognizing patterns, not by searching for moves like traditional methods.
Transformers are now helping computers understand charts better, but there are still some challenges to overcome, like reading text in images.
An open-source video generation tool called Allegro competes with commercial options, offering good quality and revealing how it was made so anyone can understand or use it.

Quantization In Depth [DRAFT]

Zela Labs • 0 implied HN points • 11 Jul 24

🕹 Technology Data science

Quantization helps in converting complex data into simpler 'tokens' that are easier to work with. These tokens can be used in models just like words in language models.
There are different quantization approaches, like Vector Quantization and Group Vector Quantization, which can improve how data is represented and processed. Each method has its own way of managing and encoding the data.
Some new strategies, like Latent Free Quantization and Finite State Quantization, use fixed values or unique arrangements to make the quantization process more efficient and effective. They simplify how data is processed without losing important information.

Retrieval Augmented Generation in Practice: Building Search for Connected Notes

Nick Savage • 0 implied HN points • 21 Nov 24

🕹 Technology Data science

Retrieval Augmented Generation (RAG) helps software retrieve information and generate new ideas using special numbers called embeddings. This makes searching for connected notes easier and more powerful.
Chunking and reranking improve the quality of search results. By breaking down text into smaller pieces and reassessing them, users can find more relevant information quickly.
Zettelgarden's graph structure has potential for creating deeper connections between notes. This could lead to more meaningful insights, not just basic search results.

Black boxes, common sense, and the cost of mistakes

Expand Mapping with Mike Morrow • 0 implied HN points • 13 Nov 24

🕹 Technology Data science

Machines today excel at specific tasks but lack general intelligence. They often produce outcomes that seem strange or unexpected even though they are based on data.
Black-box machine learning models can provide great results, but they are hard to understand. In contrast, rules-based systems are easier to explain but often perform worse.
Mistakes in AI can lead to serious issues, especially in safety-critical applications. There's an ongoing challenge in balancing the performance of machine learning with the clarity of rules-based systems.

The key concepts in web scraping

serious web3 analysis • 0 implied HN points • 16 Oct 24

🕹 Technology Data science

Every web scraping job starts with one or more URLs, called parent URLs, where the scraper begins to look for data.
Crawling helps the scraper find additional pages with the actual information needed, going beyond just the starting page.
After crawling, the data is extracted into a structured format, and filtering can be applied to narrow down the results based on specific criteria.

Generative A-Eye #12 - 4th Oct,2024

Martin’s Newsletter • 0 implied HN points • 04 Oct 24

🕹 Technology Data science

Generative avatars in AI are expected to struggle with expressing complex emotions. Most current models depend on limited emotional recognition methods, which may not capture the full range of human feelings.
The field of human image synthesis needs better data to improve how emotions are generated in avatars. Recent research introduced a new metric to help assess 3D facial expressions based on emotional descriptions.
New methods are being developed to enhance the quality of AI-generated images. A recent innovation can increase the accuracy of image prompts without sacrificing the visual quality of the output.

Generative A-Eye #4 - 19th Sept,2024

Martin’s Newsletter • 0 implied HN points • 19 Sep 24

🕹 Technology Data science

A new method called GaussianHeads can create realistic and dynamic 3D models of human heads using video inputs. This helps capture facial expressions and head movements in real-time.
The research uses a system that combines CGI techniques to enhance the quality of deepfake and human avatar production. It aims to improve how we animate faces based on video footage.
Another interesting paper evaluated AI models by collecting 2 million votes to gauge their effectiveness. This shows the growing need for thorough testing in AI development.

The Future of AI, Product, and Agentic Software

domsteil • 0 implied HN points • 23 Nov 24

🕹 Technology Data science

AI evaluations need to go beyond just accuracy. They should focus on how helpful the AI is to users and if it meets their needs effectively.
High-performance teams thrive on collaboration and quick feedback. Effective product managers should remove barriers and encourage teamwork to create innovative solutions.
Agentic software is changing how businesses operate by using smart pricing models that reflect the value AI delivers. Companies must start with smaller clients to build a strong foundation for growth.

Scaling AI in the public sector and Xmas AI chats in the pub

RSS DS+AI Section • 0 implied HN points • 29 Nov 24

🕹 Technology Data science

There is an important event happening on December 5th in London about scaling AI in the public sector.
Joe Hill will talk about funding, building, and evaluating AI in public services.
After the talk, attendees can join for discussions and drinks at a nearby pub.