Amgad’s Substack • 79 implied HN points • 21 Jan 24
- The focus of the project 'Whisper' was on scaling training with massive amounts of data, using a proven encoder-decoder architecture to avoid complicating findings with model improvements.
- The model architecture features an encoder with stem and blocks, along with a decoder incorporating cross-attention layers, and an audio processor that prepares input features from audio segments.
- Improvements in Whisper's accuracy and robustness primarily came from the scale and quality of the data, showcasing the significance of data processing over novel architecture decisions.