Sebastian Raschka • 5/16/2026

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

This article by Sebastian Raschka examines recent developments in open-weight LLM architectures focused on reducing long-context costs. It covers KV sharing and per-layer embeddings in Gemma 4, compressed convolutional attention in ZAYA1, attention budgeting in Laguna XS.2, and mHC with compressed attention in DeepSeek V4. The analysis dives into design changes within transformer blocks, residual streams, KV caches, and attention computations, providing a technical overview for those interested in cutting-edge LLM efficiency techniques.

0 comments

#Kv Cache #Transformer Optimization #Attention Mechanisms