Most teams put low-level architecture in the wrong place
How to place low-level architecture documentation in the codebase for better clarity and AI agent support.
Benjamin Cane shares insights on distributed systems, reliability patterns, performance testing, and engineering leadership, focusing on practical lessons for building resilient software.
40 articles from this blog
How to place low-level architecture documentation in the codebase for better clarity and AI agent support.
Explains why coding agents need architectural context via Architecture Decision Records (ADRs) and how to make them accessible.
Learn why health-checking the wrong listener can break gRPC services in production and how to properly monitor gRPC traffic.
Explains weighted load balancing, its importance for safe migrations, and real-world use cases in canary deployments and traffic shifting.
Explores why YOLO is a bad strategy for validating production changes and offers better methods like canary releases, shadow traffic, and smoke tests.
Explains deterministic routing as a key technique for reducing consistency problems in distributed systems at scale.
Explores an alternative microservices pattern: deploying the same service codebase across multiple platforms for local ownership and resilience.
Explains traffic mirroring in Istio/Envoy for safe production testing and observation.
Explains how Agent Skills can capture institutional knowledge for coding agents, ensuring consistent adherence to internal frameworks and practices.
Explores the evolution from saved prompts to Agent Skills, a new way to codify workflows for AI agents with metadata, scripts, and tools.
Explains why fast code generation requires robust testing, detailing a three-level pull request validation and nightly testing for confidence.
Explains the critical load balancing challenges when moving from HTTP/1 to gRPC/HTTP/2 in production and offers solutions.
Explains the difference between high availability and high resiliency in system design, and why both are crucial.
Explains how to improve AI coding agent results by providing project context via an AGENTS.md file.
Explains why Infrastructure-as-Code's primary benefit is correctness and consistency, not just speed, leading to stable production environments.
Explains why optimizing team workflows and fixing inefficiencies can have a greater long-term impact than just shipping new business features.
A software architect introduces 'The Law of Collective Amnesia' to explain how system design intent fades over time and offers strategies to defend architecture.
Explains the critical importance of defining clear performance targets and monitoring production metrics for effective software performance testing.
Explains the difference between benchmark and endurance performance testing, and why both are needed for real-world system reliability.
Explores pre-populating caches as a performance optimization, discussing its benefits, implementation trade-offs, and added complexity.