The hottest Audio Processing Substack posts right now

Images and audio are both sampled data so you can apply similar transforms to both, but ears and eyes perceive artifacts very differently so the same operation can look fine and sound awful.
Pixelating or reducing bit depth in audio creates stair-step or high-frequency errors that produce metallic squeals or hiss, and those artifacts are typically removed with lowpass/rolling-average filtering or proper DAC anti-aliasing.
Frequency-domain editing works well if you process short, overlapping windows with a Hann (sin^2) weighting and 50% overlap so the attenuations cancel out, avoiding clicks and enabling effects like pitch shifting and vocoding.

Audio is a major next frontier in AI, with models now able to hear, understand, and generate speech, music, and environmental sounds at near-human levels.
Audio is fundamentally different from text and images because it's a continuous, high-frequency time-series that requires modeling very long sequences and both short-term details (like phonemes or notes) and long-term structure (like phrases or whole melodies).
Development is happening across open-source and commercial players, and a central debate is whether to build general multimodal systems that include audio or to focus on specialized audio models tuned for sound-specific challenges.

OpenAI's Whisper model, while impressive, still has limitations and failures in speech-to-text accuracy.
Whisper's challenges include repeating segments, mixing voice and non-voice activities, and inaccuracies in timestamps.
The drawbacks of Whisper 1.0 present opportunities for learning, adaptation, and further development in enhancing speech-to-text technology.

SSMs can be used in areas beyond just language, like audio processing. This makes them very useful for handling complex and irregular data.
Meta AI is researching how SSMs can improve speech recognition, showing their potential in understanding spoken language better.
The Llama-Factory framework helps in pretraining large language models, making them more efficient and powerful.

The Unix one-liner uses commands like find, grep, xargs, and math-sum to get total minutes of audio files.
The find command lists all files and directories in the current location.
The xargs -L 1 mp3-minutes command calculates the duration in minutes for each mp3 file and then sums up the total duration using math-sum.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Stabilizing the system by fixing shaky foundations for a more robust design.
Relaunching Siev with new features like a cleaned up topic page, rich transcripts, and speaker identification page.
Siev shaping up to be an advanced audio processing pipeline that can provide insights without needing to listen to entire streams.