The hottest Embeddings Substack posts right now

And their main takeaways
Category
Top Technology Topics
Recommender systems • 76 implied HN points • 23 Feb 26
  1. Bluesky builds Discover personalization from fixed post embeddings (BLIP2) plus broad topic labels and finer HDBSCAN clusters to track user interests, after an initial two‑tower retrieval approach didn’t work out.
  2. PinnerSage captures diverse short‑ and long‑term interests by clustering a user’s recent interactions into many medoids, scoring each cluster with a time‑decay importance, and using those medoids as weighted seeds for ANN candidate retrieval.
  3. Multiple per‑user medoids ease retrieval but complicate ranking, so the plan is to use PinnerSage for candidate generation and then adopt a transformer (PinnerFormer) to create a single user embedding for efficient, accurate ranking.
Vasu’s Newsletter • 13 implied HN points • 11 Jan 26
  1. Large language models process tokens in parallel and need positional encoding to know word order; without it, reordered sentences look the same to the model.
  2. Positional encodings (like sinusoidal functions or methods such as RoPE and ALiBi) give each position a unique vector that’s combined with token embeddings, so the same word at different positions produces different vectors and relative distances can be inferred.
  3. Positional encoding only makes order visible — it doesn’t compute relationships or context; deciding which words matter to each other is handled next by self-attention.
Things I Think Are Awesome • 216 implied HN points • 15 Oct 23
  1. The post discusses using an IKEA-diagrams LoRa of SDXL for fun, generating impossible things like 'happiness' and 'poetry.'
  2. The diagrams in the post show steps to make a robot, angel, and golem, each with unique and interesting instructions.
  3. The post also touches on AI tools for code and reinforcement learning from an AI perspective.
Technology Made Simple • 159 implied HN points • 10 Oct 23
  1. Multi-modal AI integrates multiple types of data in the same training process, allowing models to represent data in a common n-dimensional space.
  2. Multi-modality adds an extra dimension to data, expanding the search space exponentially, enabling more diverse and powerful AI applications.
  3. While multi-modality enhances model performance, it does not solve fundamental issues with AI models like GPT, and simpler technologies may be more effective for certain use-cases.
TheSequence • 182 implied HN points • 03 Apr 23
  1. Vector similarity search is essential for recommendation systems, image search, and natural language processing.
  2. Vector search involves finding similar vectors to a query vector using distance metrics like L1, L2, and cosine similarity.
  3. Common vector search strategies include linear search, space partitioning, quantization, and hierarchical navigable small worlds.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Simplicity is SOTA • 2 HN points • 27 Mar 23
  1. The concept of 'embedding' in machine learning has evolved and become widely used, replacing terms like vectors and representations.
  2. Embeddings can be applied to various types of data, come from different layers in a neural network, and are not always about reducing dimensions.
  3. Defining 'embedding' has become challenging due to its widespread use, but the essence is about learned transformations that make data more useful.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots • 0 implied HN points • 03 Jan 24
  1. Synthetic data can be used to create high-quality text embeddings without needing human-labeled data. This means you can generate lots of useful training data more easily.
  2. This study shows that it's possible to create diverse synthetic data by applying different techniques to various language and task categories. This helps improve the quality of text understanding across many languages.
  3. Using large language models like GPT-4 for generating synthetic data can save time and effort. However, it’s also important to understand the limitations and ensure data quality for the best results.