In this episode, Brandon Cui, Research Scientist at MosaicML and Databricks, dives into cutting-edge advancements in AI model optimization, focusing on Reward Models and Reinforcement Learning from Human Feedback (RLHF).
Highlights include:
- How synthetic data and RLHF enable fine-tuning models to generate preferred outcomes.
- Techniques like Policy Proximal Optimization (PPO) and Direct Preference
Optimization (DPO) for enhancing response quality.
- The role of reward models in improving coding, math, reasoning, and other NLP tasks.
Connect with Brandon Cui:
https://www.linkedin.com/in/bcui19/
Fler avsnitt av Data Brew by Databricks
Visa alla avsnitt av Data Brew by DatabricksData Brew by Databricks med Databricks finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
