Sebastian Raschka • 5/16/2026

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

This article by Sebastian Raschka explores recent developments in open-weight LLM architectures focused on improving long-context efficiency. It covers key techniques such as KV sharing and per-layer embeddings in Gemma 4, compressed convolutional attention in ZAYA1-8B, layer-wise attention budgeting in Laguna XS.2, and mHC plus compressed attention in DeepSeek V4. The author explains how these design changes address KV-cache size, memory traffic, and attention cost constraints in reasoning models and agent workflows. The article provides architecture diagrams and detailed discussion of transformer block modifications, residual stream changes, and attention computation innovations, while intentionally skipping training details, benchmarks, and product comparisons.

0 comments

#Deep Learning #Transformer #Kv Cache