Minko Gechev • 2/26/2026

Skill Eval

The article discusses the importance of testing AI agent skills (procedural instructions for tools like Gemini and Claude) to prevent silent failures. It introduces Skill Eval, a TypeScript framework that runs agents in Docker containers to benchmark skill performance using deterministic and LLM-based graders. It also covers integrating these tests into CI/CD pipelines like GitHub Actions.

0 comments

#TypeScript #testing #benchmarking