Transformer articles

10/1/2025 • EN

A History of Large Language Models

A detailed academic history tracing the core ideas behind large language models, from distributed representations to the transformer architecture.

Attention Mechanism Generative Pre Training large language models Neural Networks Transformer

Richard Feynman

9/6/2025 • EN

Understanding and Implementing Qwen3 From Scratch

A hands-on tutorial implementing the Qwen3 large language model architecture from scratch using pure PyTorch, explaining its core components.

llm Mixture Of Experts Pytorch Qwen3 Transformer

Sebastian Raschka

9/6/2025 • EN

Understanding and Implementing Qwen3 From Scratch

A hands-on guide to understanding and implementing the Qwen3 large language model architecture from scratch using pure PyTorch.

llm Mixture Of Experts Pytorch Qwen3 Transformer

Sebastian Raschka

8/9/2025 • EN

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

Analyzes the architectural advancements in OpenAI's new open-weight gpt-oss models, comparing them to GPT-2 and other modern LLMs.

Gpt Oss llm Model Architecture Openai Transformer

Sebastian Raschka

7/9/2025 • EN

TabICL: Pretraining the best tabular learner

Introducing TabICL, a state-of-the-art table foundation model that uses in-context learning and improved architecture for fast, scalable tabular data prediction.

Foundation Model In Context Learning Pretraining Tabular Learning Transformer

Gael Varoquaux

7/11/2024 • EN

Questions about ARC Prize

An analysis of the ARC Prize AI benchmark, questioning if human-level intelligence can be achieved solely through deep learning and transformers.

artificial intelligence benchmark Deep Learning Neural Networks Transformer

Eric Jang

5/12/2024 • EN

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

A technical review of April 2024's major open LLM releases (Mixtral, Llama 3, Phi-3, OpenELM) and a comparison of DPO vs PPO for LLM alignment.

Dpo llm Ppo Reinforcement Learning Transformer

Sebastian Raschka

5/12/2024 • EN

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

A review and comparison of the latest open LLMs (Mixtral, Llama 3, Phi-3, OpenELM) and a study on DPO vs. PPO for LLM alignment.

llm Mixture Of Experts Ppo Reinforcement Learning Transformer

Sebastian Raschka

3/23/2024 • EN

Generative transformer from first principles in Julia

A tutorial on building a generative transformer model from scratch in Julia, trained on Shakespeare to create GPT-like text.

flux generative ai Gpt julia Transformer

Lior Sinai

1/7/2024 • EN

Language Modeling Reading List (to Start Your Paper Club)

A curated reading list of fundamental language modeling papers with summaries, designed to help start a weekly paper club for learning and discussion.

Language Modeling llm Paper Club Research Transformer

Eugene Yan

10/14/2023 • EN

Ollama - running large language models on your machine

A guide to using Ollama, an open-source CLI tool for running and customizing large language models like Llama 2 locally on your own machine.

command line llm Local AI Ollama Transformer

Unmesh Gundecha

5/21/2023 • EN

Some Intuition on Attention and the Transformer

Explains the intuition behind the Attention mechanism and Transformer architecture, focusing on solving issues in machine translation and language modeling.

Attention Mechanism Deep Learning llm NLP Transformer

Eugene Yan

2/23/2023 • EN

Some Techniques To Make Your PyTorch Models Train (Much) Faster

Techniques to accelerate PyTorch model training by 8x using PyTorch Lightning, with a DistilBERT fine-tuning example.

Lightning Model Training performance optimization Pytorch Transformer

Sebastian Raschka

2/23/2023 • EN

Some Techniques To Make Your PyTorch Models Train (Much) Faster

Learn techniques to speed up PyTorch model training by 8x using PyTorch Lightning, maintaining accuracy while reducing training time.

Model Training performance optimization Pytorch Pytorch Lightning Transformer

Sebastian Raschka

2/9/2023 • EN

Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch

A technical guide to coding the self-attention mechanism from scratch, as used in transformers and large language models.

Deep Learning Natural Language Processing Neural Networks Self Attention Transformer

Sebastian Raschka

2/9/2023 • EN

Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch

A technical guide to coding the self-attention mechanism from scratch, as used in transformers and large language models.

Natural Language Processing Neural Networks Python Self Attention Transformer

Sebastian Raschka

1/30/2023 • EN

GPT in 60 Lines of NumPy

A technical guide to implementing a GPT model from scratch using only 60 lines of NumPy code, including loading pre-trained GPT-2 weights.

Gpt Implementation Neural Networks Numpy Transformer

Jay Mody

12/20/2022 • EN

Managed Transcription with OpenAI Whisper and Hugging Face Inference Endpoints

A tutorial on deploying OpenAI's Whisper speech recognition model using Hugging Face Inference Endpoints for scalable transcription APIs.

Automatic Speech Recognition Hugging Face Inference Endpoints openai whisper Transformer

Philipp Schmid

10/25/2022 • EN

Deploy T5 11B for inference for less than $500

A tutorial on deploying the T5 11B language model for inference using Hugging Face Inference Endpoints on a budget.

Hugging Face Inference Endpoints Model Deployment T5 Model Transformer

Philipp Schmid

10/4/2022 • EN

The Illustrated Stable Diffusion

A gentle introduction to how Stable Diffusion works, explaining its components and the process of generating images from text.

ai image generation Clip stable diffusion Text2img Transformer

Jay Alammar

Transformer Articles

A History of Large Language Models

Understanding and Implementing Qwen3 From Scratch

Understanding and Implementing Qwen3 From Scratch

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

TabICL: Pretraining the best tabular learner

Questions about ARC Prize

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

Generative transformer from first principles in Julia

Language Modeling Reading List (to Start Your Paper Club)

Ollama - running large language models on your machine

Some Intuition on Attention and the Transformer

Some Techniques To Make Your PyTorch Models Train (Much) Faster

Some Techniques To Make Your PyTorch Models Train (Much) Faster

Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch

Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch

GPT in 60 Lines of NumPy

Managed Transcription with OpenAI Whisper and Hugging Face Inference Endpoints

Deploy T5 11B for inference for less than $500

The Illustrated Stable Diffusion

Select Language