The hottest Audio Processing Substack posts right now

And their main takeaways
Category
Top Technology Topics
lcamtuf’s thing 6938 implied HN points 10 Jan 26
  1. Images and audio are both sampled data so you can apply similar transforms to both, but ears and eyes perceive artifacts very differently so the same operation can look fine and sound awful.
  2. Pixelating or reducing bit depth in audio creates stair-step or high-frequency errors that produce metallic squeals or hiss, and those artifacts are typically removed with lowpass/rolling-average filtering or proper DAC anti-aliasing.
  3. Frequency-domain editing works well if you process short, overlapping windows with a Hann (sin^2) weighting and 50% overlap so the attenuations cancel out, avoiding clicks and enabling effects like pitch shifting and vocoding.
TheSequence 28 implied HN points 18 Dec 25
  1. Audio is a major next frontier in AI, with models now able to hear, understand, and generate speech, music, and environmental sounds at near-human levels.
  2. Audio is fundamentally different from text and images because it's a continuous, high-frequency time-series that requires modeling very long sequences and both short-term details (like phonemes or notes) and long-term structure (like phrases or whole melodies).
  3. Development is happening across open-source and commercial players, and a central debate is whether to build general multimodal systems that include audio or to focus on specialized audio models tuned for sound-specific challenges.
Dubverse Black 117 implied HN points 19 Apr 23
  1. OpenAI's Whisper model, while impressive, still has limitations and failures in speech-to-text accuracy.
  2. Whisper's challenges include repeating segments, mixing voice and non-voice activities, and inaccuracies in timestamps.
  3. The drawbacks of Whisper 1.0 present opportunities for learning, adaptation, and further development in enhancing speech-to-text technology.
TheSequence 119 implied HN points 22 Oct 24
  1. SSMs can be used in areas beyond just language, like audio processing. This makes them very useful for handling complex and irregular data.
  2. Meta AI is researching how SSMs can improve speech recognition, showing their potential in understanding spoken language better.
  3. The Llama-Factory framework helps in pretraining large language models, making them more efficient and powerful.
CodeFaster 108 implied HN points 25 Jul 23
  1. The Unix one-liner uses commands like find, grep, xargs, and math-sum to get total minutes of audio files.
  2. The find command lists all files and directories in the current location.
  3. The xargs -L 1 mp3-minutes command calculates the duration in minutes for each mp3 file and then sums up the total duration using math-sum.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Kiernan 0 implied HN points 09 Aug 23
  1. Stabilizing the system by fixing shaky foundations for a more robust design.
  2. Relaunching Siev with new features like a cleaned up topic page, rich transcripts, and speaker identification page.
  3. Siev shaping up to be an advanced audio processing pipeline that can provide insights without needing to listen to entire streams.