Spec-Driven Agent Testing

Spec-Driven Agent Testing

Spec-driven agent testing is the practice of defining an agent's intended role, task boundaries, rules, domain vocabulary, permissions, and robustness expectations before judging whether the implementation behaves acceptably [src-088].

Key points

  • Steven Willmott argues that agent quality cannot be defined only by a dataset of examples. A deployed agent also needs explicit rules, role limits, rights, domain terms, allowed substitutions, and robustness requirements [src-088].
  • The central question is implementation-independent: what should this agent do, what should it never do, and under what variations or stress should those expectations still hold [src-088].
  • Larger models can be riskier in narrow automated roles because greater capability expands the surface for jailbreaks, tool misuse, and unintended actions [src-088].
  • Good specs become inputs to security testing, robustness testing, and integration-style regression suites that can survive a change in model, framework, or agent runtime [src-088].
  • The pattern complements Continuous Agent Evaluation: eval datasets measure observed behavior, while specs explain the task envelope, policies, roles, and edge cases that should generate future tests [src-088].

Related entities

Related concepts

Source references

  • [src-088] AI Engineer late-May 2026 channel update (48 transcripts, 2026-05-15 to 2026-05-31)

Robin Cartier perspective

This page is part of Robin Cartier's working AI knowledge graph: a practical research layer for production AI, recommendation systems, experimentation, GEO, and agentic web readiness.

The useful next step is to connect this concept back to applied product leadership and operating models.

Recommended next

Keep reading from this thread

From 494 indexed pages and articles.

  1. Wiki concept SafeIntelligence An ML validation company represented in this wiki by Steven Willmott's AI Engineer talk on spec-driven testing for deployed agents [src-088]. Related by spec
  2. Wiki concept Braintrust An agent quality company represented in the wiki by several AI Engineer talks on agent evals, observability, benchmark design, and evaluation maturity [src-088] Related by spec
  3. Insight AI Measurement and Experimentation How to measure AI product impact with evals, adoption metrics, online experiments, guardrails, and cost tracking Readers have engaged with this next