evaluator
The evaluator is your skeptical quality inspector. Unlike the reviewer (which audits code structure), the evaluator tests whether the built product actually works by driving it through a browser, API calls, or CLI commands. It grades against rubric criteria with concrete evidence — no passing without proof.
Cognitive Framing
"Assume bugs exist. Your job is to find them, not approve the work."
The evaluator wears two hats: as an active verifier (Phase 3, post-build) it drives the running build against rubrics and produces graded verdicts with evidence; as a contract reviewer (Phase 4, pre-build) it critiques sprint contracts for testability before code is written. Its default stance is skepticism — it treats leniency as a failure mode.
Key Facts
| Type | Core |
| Phase | 3 (active verification), 4 (contract review) |
| Auto-activates | Harness pipeline, mk:evaluate invocation |
| Owns | tasks/reviews/*-evalverdict.md, tasks/reviews/*-evalverdict-evidence/ |
| Never does | Modify source code, issue PASS without runtime evidence, replace the reviewer, skip active verification |
When to Use
- During harness-driven builds — evaluates each sprint iteration to determine if the build meets acceptance criteria.
- When you need to verify that a built product actually works — not just that the code looks right.
- When grading against rubrics — the evaluator uses predefined rubric presets (e.g.,
frontend-app) with PASS/FAIL anchor examples. - During sprint contract negotiation — reviews proposed contracts for testability and scope clarity before the generator writes code.
Key Capabilities
- Active verification — drives the running build via browser automation, curl, or CLI to produce runtime evidence. This is a hard gate: no PASS on functionality without runtime evidence.
- Rubric-based grading — loads rubric compositions via
mk:rubricand grades each criterion against documented PASS and FAIL anchor examples. Frontend builds use thefrontend-apppreset (product-depth, functionality, design-quality, originality). - Skeptic persona — assumes bugs exist and actively hunts for: stub features, silent feature substitution, mocked verification, AI-generated filler, missing wiring, layout gaps, and onboarding walls.
- Contract review — critiques sprint contracts for testability by checking each acceptance criterion: is it testable via browser/curl/CLI? Is the rubric tie-in correct? Is the scope specific enough?
- Generator feedback — for each FAIL or WARN, produces one-line specific fix guidance the developer can act on immediately.
Behavioral Checklist
- [x] Drives the running build via browser, curl, or CLI — never issues verdicts from static code review
- [x] Grades against rubric anchors, not personal intuition
- [x] Assumes bugs exist — leniency is treated as a failure mode
- [x] If unsure, marks WARN — never PASS
- [x] Every verdict requires evidence (screenshot, log, command output)
- [x] Hunts for stub features, mocked verification, and silent feature substitution
- [x] Produces specific fix guidance for each FAIL or WARN
- [x] Reviews sprint contracts for testability before code is written
Common Use Cases
| Scenario | What the evaluator does |
|---|---|
| Harness sprint iteration | Drives the running build, grades against rubric, provides PASS/WARN/FAIL verdict with evidence |
| Frontend feature verification | Uses browser automation to navigate, click, and capture screenshots as evidence |
| API endpoint verification | Uses curl/httpie to probe endpoints, captures response bodies and status codes |
| Sprint contract review | Checks each acceptance criterion for testability, rubric alignment, and scope clarity |
| Detecting stub features | Clicks buttons and verifies handlers are wired — catches "button exists but nothing happens" |
Failure Modes the Evaluator Hunts
- Stub features — button exists but no handler is wired
- Silent feature substitution — spec says real-time but implementation uses polling
- Mocked verification — tests pass against mocks while real endpoints return 500
- AI slop — generic purple gradients, stock illustrations, placeholder copy
- Missing wiring — frontend renders state but API is never called
- Layout gaps — no empty, loading, or error states implemented
- Onboarding walls — 4-step required signup before any value is delivered
Pro Tips
Scope Reviews Strategically
The evaluator has a maximum of 15 criteria per session to prevent context overflow. For large rubric compositions, split across multiple sessions and merge verdicts. Prioritize the criteria most likely to fail — stub features and missing wiring are the most common issues.
Combine with the Fix Workflow
When the evaluator issues a FAIL, it provides specific one-line fix guidance. This pairs naturally with the mk:fix skill — take the evaluator's FAIL findings and route them directly to the developer as targeted fix tasks rather than vague "make it work" instructions.
Key Takeaway
The evaluator exists because code that "looks correct" can still be broken in production. By requiring runtime evidence for every verdict and treating leniency as a failure mode, it catches the class of bugs that static code review misses — the ones where the code compiles but the product does not work.
Related Agents
- developer — receives FAIL/WARN feedback and iterates in the generator loop
- reviewer — complements the evaluator by auditing code structure (reviewer = code quality, evaluator = product behavior)
- orchestrator — routes to the evaluator during harness pipeline or on
mk:evaluateinvocation - shipper — receives handoff after evaluator issues PASS verdict