filmov
tv
DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence

Показать описание
Learn about DeepSeek R1's innovative AI architecture from @deeplearningexplained. The course explores how R1 achieves exceptional reasoning through reinforcement learning, focusing on Group Relative Policy Optimization (GRPO) and how it improves upon traditional PPO methods. You'll also understand KL divergence's role in model stability, with practical code examples and clear mathematical explanations.
Contents
⌨️ (0:00:00) Introduction
⌨️ (0:01:49) R1 Overview - Overview
⌨️ (0:03:52) R1 Overview - DeepSeek R1-zero path
⌨️ (0:05:32) R1 Overview - Reinforcement learning setup
⌨️ (0:08:36) R1 Overview - Group Relative Policy Optimization (GRPO)
⌨️ (0:13:04) R1 Overview - DeepSeek R1-zero result
⌨️ (0:16:53) R1 Overview - Cold start supervised fine-tuning
⌨️ (0:17:44) R1 Overview - Consistency reward for CoT
⌨️ (0:18:35) R1 Overview - Supervised Fine tuning data generation
⌨️ (0:21:06) R1 Overview - Reinforcement learning with neural reward model
⌨️ (0:22:53) R1 Overview - Distillation
⌨️ (0:26:16) GRPO - Overview
⌨️ (0:26:55) GRPO - PPO vs GRPO
⌨️ (0:30:25) GRPO - PPO formula overview
⌨️ (0:33:25) GRPO - GRPO formula overview
⌨️ (0:36:48) GRPO - GRPO pseudo code
⌨️ (0:38:56) GRPO - GRPO Trainer code
⌨️ (0:49:24) KL Divergence - Overview
⌨️ (0:49:55) KL Divergence - KL Divergence in GRPO vs PPO
⌨️ (0:51:20) KL Divergence - KL Divergence refresher
⌨️ (0:55:32) KL Divergence - Monte Carlo estimation of KL divergence
⌨️ (0:56:43) KL Divergence - Schulman blog
⌨️ (0:57:38) KL Divergence - k1 = log(q/p)
⌨️ (1:00:01) KL Divergence - k2 = 0.5*log(p/q)^2
⌨️ (1:02:19) KL Divergence - k3 = (p/q - 1) - log(p/q)
⌨️ (1:04:44) KL Divergence - benchmarking
⌨️ (1:07:28) Conclusion
🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual
--
Contents
⌨️ (0:00:00) Introduction
⌨️ (0:01:49) R1 Overview - Overview
⌨️ (0:03:52) R1 Overview - DeepSeek R1-zero path
⌨️ (0:05:32) R1 Overview - Reinforcement learning setup
⌨️ (0:08:36) R1 Overview - Group Relative Policy Optimization (GRPO)
⌨️ (0:13:04) R1 Overview - DeepSeek R1-zero result
⌨️ (0:16:53) R1 Overview - Cold start supervised fine-tuning
⌨️ (0:17:44) R1 Overview - Consistency reward for CoT
⌨️ (0:18:35) R1 Overview - Supervised Fine tuning data generation
⌨️ (0:21:06) R1 Overview - Reinforcement learning with neural reward model
⌨️ (0:22:53) R1 Overview - Distillation
⌨️ (0:26:16) GRPO - Overview
⌨️ (0:26:55) GRPO - PPO vs GRPO
⌨️ (0:30:25) GRPO - PPO formula overview
⌨️ (0:33:25) GRPO - GRPO formula overview
⌨️ (0:36:48) GRPO - GRPO pseudo code
⌨️ (0:38:56) GRPO - GRPO Trainer code
⌨️ (0:49:24) KL Divergence - Overview
⌨️ (0:49:55) KL Divergence - KL Divergence in GRPO vs PPO
⌨️ (0:51:20) KL Divergence - KL Divergence refresher
⌨️ (0:55:32) KL Divergence - Monte Carlo estimation of KL divergence
⌨️ (0:56:43) KL Divergence - Schulman blog
⌨️ (0:57:38) KL Divergence - k1 = log(q/p)
⌨️ (1:00:01) KL Divergence - k2 = 0.5*log(p/q)^2
⌨️ (1:02:19) KL Divergence - k3 = (p/q - 1) - log(p/q)
⌨️ (1:04:44) KL Divergence - benchmarking
⌨️ (1:07:28) Conclusion
🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual
--
Комментарии