Democratizing Automation • 213 implied HN points • 22 Nov 23
- Reinforcement learning from human feedback (RLHF) is a technology that is still unknown and undocumented.
- Scaling DPO to 70B parameters showed strong performance by directly integrating the data and using lower learning rates.
- DPO and PPO have differences in their approaches, with DPO showing potential for enhancing chat evaluations and happy users of Tulu and Zephyr models.