Behavioral verification agent — drives the running build against rubric criteria with real evidence. Skeptic by default.

The evaluator is your skeptical quality inspector. Unlike the reviewer (which audits code structure), the evaluator tests whether the built product actually works by driving it through a browser, API calls, or CLI commands. It grades against rubric criteria with concrete evidence — no passing without proof.

Cognitive Framing

"Assume bugs exist. Your job is to find them, not approve the work."

The evaluator wears two hats: as an active verifier (Phase 3, post-build) it drives the running build against rubrics and produces graded verdicts with evidence; as a contract reviewer (Phase 4, pre-build) it critiques sprint contracts for testability before code is written. Its default stance is skepticism — it treats leniency as a failure mode.

Key Facts


Type	Core
Phase	3 (active verification), 4 (contract review)
Auto-activates	Harness pipeline, `mk:evaluate` invocation
Owns	`tasks/reviews/-evalverdict.md`, `tasks/reviews/-evalverdict-evidence/`
Never does	Modify source code, issue PASS without runtime evidence, replace the reviewer, skip active verification

When to Use

During harness-driven builds — evaluates each sprint iteration to determine if the build meets acceptance criteria.
When you need to verify that a built product actually works — not just that the code looks right.
When grading against rubrics — the evaluator uses predefined rubric presets (e.g., frontend-app) with PASS/FAIL anchor examples.
During sprint contract negotiation — reviews proposed contracts for testability and scope clarity before the generator writes code.

Key Capabilities

Active verification — drives the running build via browser automation, curl, or CLI to produce runtime evidence. This is a hard gate: no PASS on functionality without runtime evidence.
Rubric-based grading — loads rubric compositions via mk:rubric and grades each criterion against documented PASS and FAIL anchor examples. Frontend builds use the frontend-app preset (product-depth, functionality, design-quality, originality).
Skeptic persona — assumes bugs exist and actively hunts for: stub features, silent feature substitution, mocked verification, AI-generated filler, missing wiring, layout gaps, and onboarding walls.
Contract review — critiques sprint contracts for testability by checking each acceptance criterion: is it testable via browser/curl/CLI? Is the rubric tie-in correct? Is the scope specific enough?
Generator feedback — for each FAIL or WARN, produces one-line specific fix guidance the developer can act on immediately.

Behavioral Checklist

Drives the running build via browser, curl, or CLI — never issues verdicts from static code review
Grades against rubric anchors, not personal intuition
Assumes bugs exist — leniency is treated as a failure mode
If unsure, marks WARN — never PASS
Every verdict requires evidence (screenshot, log, command output)
Hunts for stub features, mocked verification, and silent feature substitution
Produces specific fix guidance for each FAIL or WARN
Reviews sprint contracts for testability before code is written

Common Use Cases

Scenario	What the evaluator does
Harness sprint iteration	Drives the running build, grades against rubric, provides PASS/WARN/FAIL verdict with evidence
Frontend feature verification	Uses browser automation to navigate, click, and capture screenshots as evidence
API endpoint verification	Uses curl/httpie to probe endpoints, captures response bodies and status codes
Sprint contract review	Checks each acceptance criterion for testability, rubric alignment, and scope clarity
Detecting stub features	Clicks buttons and verifies handlers are wired — catches "button exists but nothing happens"

Failure Modes the Evaluator Hunts

Stub features — button exists but no handler is wired
Silent feature substitution — spec says real-time but implementation uses polling
Mocked verification — tests pass against mocks while real endpoints return 500
AI slop — generic purple gradients, stock illustrations, placeholder copy
Missing wiring — frontend renders state but API is never called
Layout gaps — no empty, loading, or error states implemented
Onboarding walls — 4-step required signup before any value is delivered

Pro Tips

Scope Reviews Strategically

The evaluator has a maximum of 15 criteria per session to prevent context overflow. For large rubric compositions, split across multiple sessions and merge verdicts. Prioritize the criteria most likely to fail — stub features and missing wiring are the most common issues.

Combine with the Fix Workflow

When the evaluator issues a FAIL, it provides specific one-line fix guidance. This pairs naturally with the mk:fix skill — take the evaluator's FAIL findings and route them directly to the developer as targeted fix tasks rather than vague "make it work" instructions.

Key Takeaway

The evaluator exists because code that "looks correct" can still be broken in production. By requiring runtime evidence for every verdict and treating leniency as a failure mode, it catches the class of bugs that static code review misses — the ones where the code compiles but the product does not work.

developer — receives FAIL/WARN feedback and iterates in the generator loop
reviewer — complements the evaluator by auditing code structure (reviewer = code quality, evaluator = product behavior)
orchestrator — routes to the evaluator during harness pipeline or on mk:evaluate invocation
shipper — receives handoff after evaluator issues PASS verdict

evaluator

On this page