Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
Read OriginalThis article by Sebastian Raschka explores recent developments in open-weight LLM architectures focused on improving long-context efficiency. It covers key techniques such as KV sharing and per-layer embeddings in Gemma 4, compressed convolutional attention in ZAYA1-8B, layer-wise attention budgeting in Laguna XS.2, and mHC plus compressed attention in DeepSeek V4. The author explains how these design changes address KV-cache size, memory traffic, and attention cost constraints in reasoning models and agent workflows. The article provides architecture diagrams and detailed discussion of transformer block modifications, residual stream changes, and attention computation innovations, while intentionally skipping training details, benchmarks, and product comparisons.
Comments
No comments yet
Be the first to share your thoughts!
Browser Extension
Get instant access to AllDevBlogs from your browser
Top of the Week
No top articles yet