Spec-Driven Agent Testing
Spec-driven agent testing is the practice of defining an agent's intended role, task boundaries, rules, domain vocabulary, permissions, and robustness expectations before judging whether the implementation behaves acceptably [src-088].
Key points
- Steven Willmott argues that agent quality cannot be defined only by a dataset of examples. A deployed agent also needs explicit rules, role limits, rights, domain terms, allowed substitutions, and robustness requirements [src-088].
- The central question is implementation-independent: what should this agent do, what should it never do, and under what variations or stress should those expectations still hold [src-088].
- Larger models can be riskier in narrow automated roles because greater capability expands the surface for jailbreaks, tool misuse, and unintended actions [src-088].
- Good specs become inputs to security testing, robustness testing, and integration-style regression suites that can survive a change in model, framework, or agent runtime [src-088].
- The pattern complements Continuous Agent Evaluation: eval datasets measure observed behavior, while specs explain the task envelope, policies, roles, and edge cases that should generate future tests [src-088].
Related entities
Related concepts
- Continuous Agent Evaluation
- Agent Security Boundaries
- Test Oracle Driven Agents
- Harness Engineering
- AI Engineering Discipline
Source references
- [src-088] AI Engineer late-May 2026 channel update (48 transcripts, 2026-05-15 to 2026-05-31)
Recommended next
Keep reading from this thread
From 494 indexed pages and articles.
- Wiki concept SafeIntelligence An ML validation company represented in this wiki by Steven Willmott's AI Engineer talk on spec-driven testing for deployed agents [src-088]. Related by spec
- Wiki concept Braintrust An agent quality company represented in the wiki by several AI Engineer talks on agent evals, observability, benchmark design, and evaluation maturity [src-088] Related by spec
- Insight AI Measurement and Experimentation How to measure AI product impact with evals, adoption metrics, online experiments, guardrails, and cost tracking Readers have engaged with this next