Reinforcement Learning articles

3/29/2025 • EN

First Look at Reasoning From Scratch: Chapter 1

An introduction to reasoning in Large Language Models, covering concepts like chain-of-thought and methods to improve LLM reasoning abilities.

ai large language models LLM Reasoning Machine Learning Reinforcement Learning

Sebastian Raschka

2/5/2025 • EN

Understanding Reasoning LLMs

Explores four main approaches to building and enhancing reasoning capabilities in Large Language Models (LLMs) for complex tasks.

Deepseek LLM Reasoning Model Specialization Reinforcement Learning Supervised Finetuning

Sebastian Raschka

2/1/2025 • EN

Finetune Granite3.1 for Reasoning

A technical guide on fine-tuning IBM's Granite3.1 AI model using Guided Reward Policy Optimization (GRPO) to enhance its reasoning capabilities.

Finetuning Granite31 Grpo Reasoning Reinforcement Learning

Ruslan Magana Vsevolodovna

1/30/2025 • EN

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

A tutorial on reproducing DeepSeek R1's RL 'aha moment' using Group Relative Policy Optimization (GRPO) to train a model on the Countdown numbers game.

Deepseek R1 Group Relative Policy Optimization Grpo Reasoning Reinforcement Learning

Philipp Schmid

1/17/2025 • EN

Bite: How Deepseek R1 was trained

Explains the training of DeepSeek-R1, focusing on the Group Relative Policy Optimization (GRPO) reinforcement learning method.

Deepseek Grpo LLM Training Proximal Policy Optimization Reinforcement Learning

Philipp Schmid

11/28/2024 • EN

Reward Hacking in Reinforcement Learning

Explores reward hacking in reinforcement learning, where AI agents exploit reward function flaws, and its critical impact on RLHF and language model alignment.

Alignment Language Models Reinforcement Learning Reward Hacking Rlhf

Lilian Weng

5/12/2024 • EN

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

A technical review of April 2024's major open LLM releases (Mixtral, Llama 3, Phi-3, OpenELM) and a comparison of DPO vs PPO for LLM alignment.

Dpo llm Ppo Reinforcement Learning Transformer

Sebastian Raschka

5/12/2024 • EN

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

A review and comparison of the latest open LLMs (Mixtral, Llama 3, Phi-3, OpenELM) and a study on DPO vs. PPO for LLM alignment.

llm Mixture Of Experts Ppo Reinforcement Learning Transformer

Sebastian Raschka

3/31/2024 • EN

Tips for LLM Pretraining and Evaluating Reward Models

Discusses strategies for continual pretraining of LLMs and evaluating reward models for RLHF, based on recent research papers.

AI Research LLM Pretraining Model Alignment Reinforcement Learning Reward Modeling

Sebastian Raschka

3/31/2024 • EN

Tips for LLM Pretraining and Evaluating Reward Models

Analysis of recent AI research papers on continued pretraining for LLMs and reward modeling for RLHF, with insights into model updates and alignment.

Continued Pretraining LLM Pretraining Model Alignment Reinforcement Learning Reward Modeling

Sebastian Raschka

4/5/2023 • EN

An AI Miracle Malcontent

A critical analysis of GPT-4's capabilities, questioning the 'miracle' narrative and exploring the technical foundations behind its success.

artificial intelligence Gpt4 Machine Learning Openai Reinforcement Learning

John Langford

6/30/2022 • EN

The Data Scientist Show - Reinforcement Learning, Productivity, and more (Podcast)

A podcast interview discussing reinforcement learning applications, data science career paths, and productivity insights for tech professionals.

Data Science Machine Learning podcast productivity Reinforcement Learning

Susan Shu Chang

5/8/2022 • EN

Bandits for Recommender Systems

Explores bandit algorithms like ε-greedy, UCB, and Thompson Sampling to improve recommender systems by balancing exploration and exploitation.

Bandits Exploration Exploitation Machine Learning Recommender Systems Reinforcement Learning

Eugene Yan

11/18/2021 • EN

Permutation-Invariant Neural Networks for Reinforcement Learning

Introduces permutation-invariant neural networks for RL agents, enabling robustness to shuffled, noisy, or incomplete sensory inputs.

Neural Networks Permutation Invariance Reinforcement Learning Robustness Sensory Substitution

David Ha

9/5/2021 • EN

Reinforcement Learning for Recommendations and Search

Explores how reinforcement learning methods like bandits and policy-based approaches can improve recommendation systems by optimizing for long-term rewards.

Contextual Bandits Multi Armed Bandits recommendation systems Reinforcement Learning Search

Eugene Yan

4/23/2021 • EN

ALT Highlights – An Interview with Joelle Pineau

An interview with AI researcher Joelle Pineau discussing her work in reinforcement learning, its applications, and advice for newcomers to the field.

Academia artificial intelligence Learning Theory Machine Learning Reinforcement Learning

John Langford

11/12/2020 • EN

Notes on Causally Correct Partial Models

Explains the concept of causally correct partial models for reinforcement learning in POMDPs, focusing on counterfactual policy evaluation.

Causal Inference Machine Learning Partial Observability Pomdp Reinforcement Learning

Ferenc Huszár

8/5/2020 • EN

Chapter 1: Introduction to Machine Learning and Deep Learning

An introductory chapter on machine learning and deep learning, covering core concepts, categories, and terminology from a university course.

Deep Learning Machine Learning Reinforcement Learning Supervised Learning Unsupervised Learning

Sebastian Raschka

8/5/2020 • EN

Chapter 1: Introduction to Machine Learning and Deep Learning

An introductory chapter on machine learning and deep learning, covering core concepts, categories, and the shift from traditional programming.

Deep Learning Machine Learning Reinforcement Learning Supervised Learning Unsupervised Learning

Sebastian Raschka

7/21/2020 • EN

HOMER: Provable Exploration in Reinforcement Learning

Introduces HOMER, a new reinforcement learning algorithm that solves key problems like global exploration and decoding latent dynamics with provable guarantees.

Exploration Icml 2020 Latent Dynamics Provable Algorithms Reinforcement Learning

John Langford

Reinforcement Learning Articles

First Look at Reasoning From Scratch: Chapter 1

Understanding Reasoning LLMs

Finetune Granite3.1 for Reasoning

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

Bite: How Deepseek R1 was trained

Reward Hacking in Reinforcement Learning

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

Tips for LLM Pretraining and Evaluating Reward Models

Tips for LLM Pretraining and Evaluating Reward Models

An AI Miracle Malcontent

The Data Scientist Show - Reinforcement Learning, Productivity, and more (Podcast)

Bandits for Recommender Systems

Permutation-Invariant Neural Networks for Reinforcement Learning

Reinforcement Learning for Recommendations and Search

ALT Highlights – An Interview with Joelle Pineau

Notes on Causally Correct Partial Models

Chapter 1: Introduction to Machine Learning and Deep Learning

Chapter 1: Introduction to Machine Learning and Deep Learning

HOMER: Provable Exploration in Reinforcement Learning

Select Language