1/30/2025
•
EN
Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial
A tutorial on reproducing DeepSeek R1's RL 'aha moment' using Group Relative Policy Optimization (GRPO) to train a model on the Countdown numbers game.