Understanding and Coding the KV Cache in LLMs from Scratch
Explains the KV cache technique for efficient LLM inference with a from-scratch code implementation.
SebastianRaschka.com is the personal blog of Sebastian Raschka, PhD, an LLM research engineer whose work bridges academia and industry in AI and machine learning. On his blog and notes section he publishes deep, well-documented articles on topics such as LLMs (large language models), reasoning models, machine learning in Python, neural networks, data science workflows, and deep learning architecture. Recent posts explore advanced themes like “reasoning LLMs”, comparisons of modern open-weight transformer architectures, and guides for building, training, or analyzing neural networks and model internals.
110 articles from this blog
Explains the KV cache technique for efficient LLM inference with a from-scratch code implementation.
A course teaching how to code Large Language Models (LLMs) from scratch to deeply understand their inner workings and fundamentals.
Analyzes the use of reinforcement learning to enhance reasoning capabilities in large language models (LLMs) like GPT-4.5 and o3.
An introduction to reasoning in Large Language Models, covering concepts like chain-of-thought and methods to improve LLM reasoning abilities.
Explores inference-time compute scaling methods to enhance the reasoning capabilities of large language models (LLMs) for complex problem-solving.
Explores four main approaches to building and enhancing reasoning capabilities in Large Language Models (LLMs) for complex tasks.
A curated list of 12 influential LLM research papers from 2024, highlighting key advancements in AI and machine learning.
A step-by-step guide to implementing the Byte Pair Encoding (BPE) tokenizer from scratch, used in models like GPT and Llama.
A curated list of notable LLM and AI research papers published in 2024, providing a resource for those interested in the latest developments.
Explains how multimodal LLMs work, compares recent models like Llama 3.2, and outlines two main architectural approaches for building them.
A 3-hour coding workshop teaching how to implement, train, and use Large Language Models (LLMs) from scratch with practical examples.
Analyzes the latest pre-training and post-training methodologies used in state-of-the-art LLMs like Qwen 2, Apple's models, Gemma 2, and Llama 3.1.
Explores recent research on instruction finetuning for LLMs, including a cost-effective method for generating synthetic training data from scratch.
A 1-hour presentation on the LLM development cycle, covering architecture, training, finetuning, and evaluation methods.
Analysis of new LLM research on instruction masking and LoRA finetuning methods, with practical insights for developers.
A technical review of April 2024's major open LLM releases (Mixtral, Llama 3, Phi-3, OpenELM) and a comparison of DPO vs PPO for LLM alignment.
Explores methods for using and finetuning pretrained large language models, including feature-based approaches and parameter updates.
Analysis of recent AI research papers on continued pretraining for LLMs and reward modeling for RLHF, with insights into model updates and alignment.
A summary of February 2024 AI research, covering new open-source LLMs like OLMo and Gemma, and a study on small, fine-tuned models for text summarization.
A guide to implementing LoRA and the new DoRA method for efficient model finetuning in PyTorch from scratch.