What surprised me
Building this made one thing immediately obvious: the agentic framework makes speed and convenience unprecedented. What used to take hours of manual test design (mapping assumptions, writing adversarial inputs, covering edge cases) now takes seconds.
But two things surprised me about the limits of what gets generated.
The first is that the tests are perishable. The structure is evergreen: golden path, edge cases, adversarial. The actual test inputs inside it aren't. Every time the agent improves, the adversarial cases that used to challenge it become trivial. Every time the product scope shifts, the golden path tests silently become wrong. You're not building a test suite once. You're committing to regenerating it continuously.
The second is the measurement problem. The tool produces fifteen tests and pass criteria that read as authoritative. But fifteen is just fifteen sampled points in an infinite input space. And “pass criteria” written by a model has its own blind spots; you're often measuring a ruler with another ruler. Coverage feels real before you look closely at it.
That convenience is exactly where the risk lives. The faster output arrives, the easier it is to skim, approve, and move on. Reading every test, questioning whether the adversarial inputs are actually adversarial enough, asking what the framework isn't measuring: that still requires the same focused effort it always did. That part AI will never do for you.