The hottest Data Analysis Substack posts right now

And their main takeaways

Speed to Search Success: Synonyms

Talking to Computers: The Email • 0 implied HN points • 14 Jun 24

🕹 Technology Data Analysis

Using synonyms in search helps users find what they need faster. It allows them to use their own words instead of worrying about exact terms.
Creating synonyms can be tricky, but observing how users search can help build a better list. Watching what terms people actually use is more effective than guessing.
While synonyms cover many cases, they struggle with specific long terms. For more complex searches, vector search technology might be a better solution.

The Optimal Amount of Irrelevant Search Results is Non-Zero

Talking to Computers: The Email • 0 implied HN points • 22 Apr 24

🕹 Technology Data Analysis

Sometimes, it's okay to have a few irrelevant search results mixed in with the good ones. This balance can help show more options, even if some aren't what you wanted.
Businesses often choose to include a small number of unrelated items in search results. This helps them find a middle ground between showing only perfect matches and potentially missing out on useful items.
In systems like AI, having occasional mistakes or 'hallucinations' can spark creativity. It's about finding the right balance that works for the situation.

How to do groupby for Hugging Face datasets

machinelearninglibrarian • 0 implied HN points • 18 Sep 23

🕹 Technology Data Analysis

Hugging Face's datasets don't have built-in groupby features, but you can use Polars to handle this. You can load datasets with Polars and perform group operations easily.
Polars allows you to work with large datasets efficiently using lazy evaluation. This means you can process data without needing to load everything into memory all at once.
You can visualize data comparisons after grouping by specific columns, making it easier to understand patterns or insights from the data.

Inflation data disappoints, notes on Crunchbase's AI strategy, and the latest OpenAI math

Alex's Personal Blog • 0 implied HN points • 10 Oct 24

💰 Finance Data Analysis

September's inflation data showed a 0.2% rise, with the yearly change at 2.4%. This suggests some ongoing economic pressure.
Crunchbase is focusing on AI by enhancing its data tools. They introduced AI-powered search features to improve access to their extensive data.
OpenAI is projected to have significant cash losses but could still become profitable by 2029 with a strong revenue base. The risks of high spending in this sector are considerable.

Introducing Exchange Flow Metrics

Coin Metrics' State of the Network • 0 implied HN points • 22 Oct 24

🔮 Crypto Data Analysis

New metrics help track Bitcoin and Ethereum flows to and from exchanges. This data can show how much people are buying or selling and help understand the market.
There has been an increase in miners sending Bitcoin to exchanges recently. This could be due to them wanting to secure profits before changes in Bitcoin rewards.
Crypto.com is gaining a larger share of the Bitcoin market lately. By looking at trading volumes and flow data, we can tell if market activity is genuine or just fake trades.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

The PacBio Vega Chips

ASeq Newsletter • 0 implied HN points • 12 Nov 24

🕹 Technology Data Analysis

The PacBio Vega Chips are similar to the Revio chips, but they provide much less data. This means they might not be as powerful for certain tasks.
The data from the Vega chips is available for analysis, and people can check it out for deeper understanding.
This information is part of a subscription service, which means you can get more insights if you become a paid member.

Generative A-Eye #2 - 17th Sept,2024

Martin’s Newsletter • 0 implied HN points • 17 Sep 24

🕹 Technology Data Analysis

The best day for submitting new AI research papers tends to be Tuesday. This timing is likely chosen to catch attention after the weekend.
This year has seen fewer exciting advancements in AI-based human synthesis, with technologies being reused rather than creating entirely new concepts.
New research is focusing on better facial expression recognition and human reconstruction from single images, showing promise in areas like understanding micro-emotions.

Analyzing PG essays in the semantic space

Experiments with NLP and GPT-3 • 0 implied HN points • 09 Nov 24

🕹 Technology Data Analysis

The writing style has shifted from a smooth, flowing approach to a more structured, geometric style in 2024.
There are sharper transitions between ideas now, making it clear when topics change.
The points made in the writing are more organized into distinct clusters, suggesting a more deliberate way of presenting ideas.

sqlmesh cube_generate build part 1

davidj.substack • 0 implied HN points • 17 Dec 24

🕹 Technology Data Analysis

There's a new command called `sqlmesh cube_generate` that helps build models for data analysis. It's designed to make working with data easier for users.
The tool outputs useful information in a structured format, which includes joins and fields for data analysis. This makes it simple to understand how the data connects.
Even if there are challenges with complex data types, the output is still effective and can be enhanced using AI, showing there's room for creativity in data modeling.

Thoughts on Oddly Specific Versioning in Observability

A Small, Good Thing • 0 implied HN points • 30 Dec 24

🕹 Technology Data Analysis

Many people just want basic monitoring tools that are easy to use and affordable. They care more about practical solutions than getting into complex observability concepts.
There's a balance between reliability, shipping speed, and team well-being that needs to be carefully managed. It's important not to sacrifice too much reliability just to be fast.
The focus should be on delivering a cost-effective way to monitor systems, rather than just aiming for the latest version of observability. It's essential to figure out who will handle the work involved.

New Features: Trading Journal & Bid/Ask Spread

Theory A : Visualize Value Investing • 0 implied HN points • 14 Jan 25

💰 Finance Data Analysis

A new trading journal feature helps you see all your open positions in one place. This makes it easier to keep track of different option contracts and their expiration dates.
There's improved bid-ask data with a new system that's more accurate. You can now see where the current price is in relation to your contracts with a color-coded line.
The free access to options data has been extended from 30 days to 180 days. This gives you more time to analyze market trends without needing a paid subscription.

Statistical Concepts Useful in Life

Kartick’s Blog • 0 implied HN points • 21 Jan 25

🚌 Education Data Analysis

Variance helps us understand risk in different jobs. A steady job is low risk, while a startup can be very unpredictable.
The median is a strong way to find a typical value because it's not easily affected by extreme numbers. So, when data is messy, the median usually gives a better answer than the mean.
To get better estimates, look at a lot of data over time. More data usually means less error, helping you make smarter decisions.

Not All Zeros Are Equal

Nano Thoughts • 0 implied HN points • 20 Jan 25

🔬 Science Data Analysis

Not all zeros in data mean the same thing. Sometimes, they can indicate something was never there, or other times, they mean something was just missed.
Zero inflation happens when there's lots of data and many readings come back as zero. This can make it hard to understand what's really going on behind those zeros.
There are different methods to deal with zeros in data, like checking if they are real or just unnoticed signals. Choosing the right method is important to get accurate insights.

Randomness, dodging phone bans & astrological strategies

The Strategy Toolkit • 0 implied HN points • 27 Jan 25

🕹 Technology Data Analysis

People expect randomness to seem chaotic, but true randomness can appear ordered. This misunderstanding affects how we perceive things like music playlists.
Users often complain about problems with shuffle algorithms, thinking they should never see clusters of songs from the same artist. But statistically, that can happen and is actually normal.
Our brains are wired to look for patterns, making us think randomness should behave in a way that fits our expectations, rather than how it actually works.

Roche Nanopore: Instrument Cost

ASeq Newsletter • 0 implied HN points • 27 Feb 25

🔬 Science Data Analysis

Roche is working on new nanopore sequencing technology, focusing on how much the instruments will cost to produce. Understanding these costs is important for the technology's success.
The nanopore sequencing process involves collecting a large amount of data quickly, which means the data rates are extremely high. This could lead to challenges in storing and processing such vast amounts of information.
Since the raw data volume is so large, it's unlikely that most users will store it all. Instead, they will probably need to focus on analyzing only the most crucial information collected.

PromethION 84Gb/Flowcell

ASeq Newsletter • 0 implied HN points • 09 Jun 25

🕹 Technology Data Analysis

The PromethION flowcell has an average output of about 84Gb per run. This is important for understanding how much data you can expect.
In comparison, the PacBio flowcell seems to produce higher quality data with around 120-150Gb. This could make it a better option for some users.
Cost per gigabyte is lower for PacBio, making it potentially more affordable when analyzing large amounts of data.

SQL in Hex

Expand Mapping with Mike Morrow • 0 implied HN points • 14 Jul 25

🕹 Technology Data Analysis

You can choose how SQL query results are stored in Hex, either in memory or in the database. This affects how quickly you can run follow-up queries.
There are two types of SQL commands in Hex: one that queries directly from the database and another that queries from a local in-memory dataframe. This choice can impact how your data is used.
Hex allows you to chain SQL queries, which makes handling complex tasks easier. However, you need to be aware of where each query pulls data from to avoid surprises.

ML and questions of why (causation)

Expand Mapping with Mike Morrow • 0 implied HN points • 14 Aug 25

🕹 Technology Data Analysis

Supervised machine learning helps us understand how inputs relate to outputs, but just because two things move together doesn't mean one causes the other.
To prove something causes another, experiments are the best way, but we can also make educated guesses using causal diagrams, like trees that show how different factors connect.
Machine learning models are great at predictions but aren't designed to show cause and effect; we can use them to help create clearer models for understanding these relationships.

The New Year's Resolution Cliff

The Healthtech Initiative • 0 implied HN points • 05 Jan 26

🏥 Health & Wellness Data Analysis

Most people quit their new‑year sleep resolutions almost immediately — 60% stop within 48 hours and the median streak is one day, with under 3% lasting beyond five days.
People who kept up the changes at first actually slept worse short‑term: they went to bed earlier and tracked routines more, yet their time to fall asleep rose to about 26+ minutes.
Trying harder often makes sleep worse, so the common New Year’s resolution approach to ‘optimize’ sleep is counterproductive and needs a different framework.

अलकनंदा आकाशगङ्गा की खोज इतनी महत्वपूर्ण क्यों है?

FutureIQ • 0 implied HN points • 07 Jan 26

🔬 Science Data Analysis

A well-formed two-armed spiral galaxy called Alaknanda was observed at redshift z≈4, meaning we see it as it was about 12 billion years ago — only ~1.5 billion years after the Big Bang.
The galaxy’s mature disk and clear spiral arms so early in cosmic history conflict with current models that predict such structures need about 3–4 billion years to form, so our theories of galaxy formation need revision or expansion.
The discovery relied on deep JWST infrared data, gravitational lensing, and advanced analysis of public datasets, highlighting how modern instruments and open data can enable unexpected breakthroughs.