Submit Blog

Sign up Sign in

Philipp Schmid • 1/30/2025

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Read Original

This technical tutorial details how to replicate the reinforcement learning 'aha moment' from the DeepSeek R1 paper, where the model learned self-verification. It guides readers through using Group Relative Policy Optimization (GRPO) and Q-LoRA to train an open model on the Countdown numbers puzzle, covering setup, sample generation, and distributed training with Deepspeed and vLLM.

0 comments

#Reasoning #Reinforcement Learning #Grpo

#Reasoning #Reinforcement Learning #Grpo

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser

Top of the Week

1

Limit token usage in Microsoft Agent Framework

Jesse Liberty • 1 votes

2

How to Roll Back AI Agents: Incident Response, Circuit Breakers, and Recovery Patterns

Paul Bryant • 1 votes

3

Avoiding Reasoning Model Failures with Microsoft Foundry

Luke Murray • 1 votes

4

When Your AI Agent Lies: Silent LLM Fallbacks

Luke Murray • 1 votes

5

Adding a custom MCP server to Claude and ChatGPT

Simon Willison • 1 votes

6

Testing AI prompts and comparing models with promptfoo

Tim Deschryver • 1 votes

7

Mitchell Hashimoto • 1 votes