Democratizing Automation • 672 implied HN points • 24 Nov 23
- Q* hypothesis involves tree-of-thoughts reasoning and process reward models for supercharging synthetic data
- The method combines self-play and look-ahead planning for language models
- Process Reward Models (PRMs) emphasize scoring each step of reasoning rather than the entire message