The hottest Datasets Substack posts right now

Import AI 354: Distributed LLM inference; CCP-approved dataset; AI scientists

Import AI • 1278 implied HN points • 25 Dec 23

Distributed inference is becoming easier with AI collectives, allowing small groups to work with large language models more efficiently and effectively.
Automation in scientific experimentation is advancing with large language models like Coscientist, showcasing the potential for LLMs to automate parts of the scientific process.
Chinese government's creation of a CCP-approved dataset for training large language models reflects the move towards LLMs aligned with politically correct ideologies, showcasing a unique approach to LLM training.

The latest open artifacts (#6): Reasoning models, China's lead in open-source, and a growing multimodal space

Democratizing Automation • 261 implied HN points • 27 Jan 25

🕹 Technology AI Models Open Source Datasets Reasoning Geopolitics

Chinese AI labs are now leading the way in open-source models, surpassing their American counterparts. This shift could have significant impacts on global technology and geopolitics.
A variety of new AI models and datasets are emerging, particularly focused on reasoning and long-context capabilities. These innovations are making it easier to tackle complex tasks in coding and math.
Companies like IBM and Microsoft are quietly making strides with their AI models, showing that many players in the market are developing competitive technology that might not get as much attention.

Import AI 360: Guessing emotions; drone targeting dataset; frameworks for AI alignment

Import AI • 379 implied HN points • 12 Feb 24

🕹 Technology AI Datasets Security Language Models Ethics

Teaching AI to understand complex human emotions like joy, surprise, and anger can help in applications like surveillance and advertising.
AI systems, like other software, are vulnerable to attacks, as shown by a demonstration breaking MoE models with a buffer overflow attack.
Frameworks are being developed to ensure AI systems align with diverse human values, considering various perspectives and how to measure alignment.
The development of AI systems is advancing in areas like emotion recognition, system security, and value alignment.
Researchers are pushing the boundaries of AI capabilities, from emotion recognition to security to ethical alignment.
Current AI trends indicate growth in researching human emotions, security vulnerabilities, and ethical considerations.

Import AI 336: Financialized AI; public and elite AI opinion; one million insects.

Import AI • 519 implied HN points • 14 Aug 23

🕹 Technology AI Finance Research Datasets Digital Life

The financialization of AI is increasing, with companies finding new ways to fund AI projects through unconventional means like debt collateralized against GPUs.
AI benchmarks are being solved faster, indicating either accelerated AI progress or the increasing complexity in building good benchmarks.
Public opinion, reflected in a poll, shows significant concerns about AI development and regulation, contrasting with elite opinions that emphasize rapid AI advancement.

RLHF learning resources in 2024

Democratizing Automation • 435 implied HN points • 12 Jan 24

🕹 Technology Research Code Models Datasets

The post shares a categorized list of resources for learning about Reinforcement Learning from Human Feedback (RLHF) in 2024.
The resources include videos, research talks, code, models, datasets, evaluations, blog posts, and other related materials.
The aim is to provide a variety of learning tools for individuals with different learning styles interested in going deeper into RLHF.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Artifacts 5: Mini RLHF book underway, Qwen 2.5, video datasets, audio models, and more

Democratizing Automation • 63 implied HN points • 24 Oct 24

🕹 Technology AI Models Datasets Machine Learning Speech Recognition

There's a new textbook on RLHF being written that aims to help readers learn and improve the content through feedback.
Qwen 2.5 models are showing strong performance, competing well with models like Llama 3.1, but have less visibility in the community.
Several new models and datasets have been released, including some interesting multimodal options that can handle both text and images.

Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions

Democratizing Automation • 126 implied HN points • 10 Jan 24

🕹 Technology AI Multimodal models Datasets

Multi-modal models are advancing to complement information processing capabilities by incorporating diverse inputs and outputs.
Unified IO 2 introduces a novel autoregressive multimodal model capable of generating and understanding images, text, audio, and action through shared semantic space processing.
LLaVA-RLHF explores new factually augmented RLHF techniques and datasets to bridge misalignment between different modalities and enhance multimodal models.

Why reward models are key for alignment

Democratizing Automation • 110 implied HN points • 14 Feb 24

🕹 Technology AI Machine Learning Models Datasets Tools

Reward models provide a unique way to assess language models without relying on traditional prompting and computation limits.
Constructing comparisons with reward models helps identify biases and viewpoints, aiding in understanding language model representations.
Generative reward models offer a simple way to classify preferences in tasks like LLM evaluation, providing clarity and performance benefits in the RL setting.

Because there's not enough fun datasets

Jacobo’s Substack • 1 HN point • 23 Jun 24

🕹 Technology Datasets Open Source Data Collection GitHub Data Analysis

The dataset shared focuses on PSG ticket price evolution for the 2023 - 2024 season, collected through scraping the Ticketplace marketplace.
The data format is simple, featuring columns for timestamp, fixture, category, quantity, and price, providing a basis for analyzing ticket pricing trends and making predictions.
The release of this dataset is aimed at facilitating student projects and filling the gap for attractive, open-source datasets for data analysis.

Datasets as Imagination

Reboot • 16 implied HN points • 02 Sep 23

🎨 Art & Illustration Datasets AI Art Representation

Well-made datasets are considered works of art
Current datasets are often exploitative of artists for creative work
Artist datasets provide new opportunities for creativity and income, while offering a more intentional and ethical approach to AI art

Can LLaMA approve credit card applications? (Part 1)

followfox.ai’s Newsletter • 4 HN points • 03 May 23

🕹 Technology AI Fintech Machine Learning Datasets APIs

LLaMA models of size 13B or above might be better than random chance at evaluating credit card approvals.
Smaller LLaMA models (7B) didn't show improvement over random chance.
Instruction-finetuning didn't significantly enhance model performance.

A Large Language Model for Healthcare | NHS-LLM and OpenGPT

AI for Healthcare • 2 HN points • 10 May 23

🕹 Technology Artificial Intelligence Machine Learning Healthcare Datasets Model Training

OpenGPT is a framework for producing domain-specific language models.
NHS-LLM is a conversational model for healthcare created using OpenGPT.
Creating instruction-based datasets and fine-tuning models are crucial steps in building large language models for healthcare.

Performance of Domain-Wall Encoding for Quantum Annealing

Quantum Formalism • 0 implied HN points • 08 Mar 21

🕹 Technology Quantum Computing Datasets Code

The talk about the performance of Domain-Wall Encoding for Quantum Annealing provides free access to datasets and code through a downloadable link.
Those interested in course certification are encouraged to explore the paper and possibly challenge themselves by using it for practical work.
Attendees of the session have the opportunity to ask questions directly to Nick, enhancing the learning experience and understanding of the topic.

Benchmark datasets for learning-to-rank

Simplicity is SOTA • 0 implied HN points • 11 Mar 24

🕹 Technology Machine Learning Datasets Research Algorithms

Benchmark datasets are crucial in ML literature, providing a standard for evaluating new methods and influencing research directions.
In learning-to-rank, the Yahoo and Microsoft datasets are prominent, with Yahoo dataset being widely used in notable papers.
When writing a paper using benchmark datasets, researchers must choose ML algorithms, consider user behavior, generate initial rankings, and evaluate performance with metrics like NDCG.