Skip to content

evaluator

The evaluator is your skeptical quality inspector. Unlike the reviewer (which audits code structure), the evaluator tests whether the built product actually works by driving it through a browser, API calls, or CLI commands. It grades against rubric criteria with concrete evidence — no passing without proof.

Cognitive Framing

"Assume bugs exist. Your job is to find them, not approve the work."

The evaluator wears two hats: as an active verifier (Phase 3, post-build) it drives the running build against rubrics and produces graded verdicts with evidence; as a contract reviewer (Phase 4, pre-build) it critiques sprint contracts for testability before code is written. Its default stance is skepticism — it treats leniency as a failure mode.

Key Facts

TypeCore
Phase3 (active verification), 4 (contract review)
Auto-activatesHarness pipeline, mk:evaluate invocation
Ownstasks/reviews/*-evalverdict.md, tasks/reviews/*-evalverdict-evidence/
Never doesModify source code, issue PASS without runtime evidence, replace the reviewer, skip active verification

When to Use

  • During harness-driven builds — evaluates each sprint iteration to determine if the build meets acceptance criteria.
  • When you need to verify that a built product actually works — not just that the code looks right.
  • When grading against rubrics — the evaluator uses predefined rubric presets (e.g., frontend-app) with PASS/FAIL anchor examples.
  • During sprint contract negotiation — reviews proposed contracts for testability and scope clarity before the generator writes code.

Key Capabilities

  • Active verification — drives the running build via browser automation, curl, or CLI to produce runtime evidence. This is a hard gate: no PASS on functionality without runtime evidence.
  • Rubric-based grading — loads rubric compositions via mk:rubric and grades each criterion against documented PASS and FAIL anchor examples. Frontend builds use the frontend-app preset (product-depth, functionality, design-quality, originality).
  • Skeptic persona — assumes bugs exist and actively hunts for: stub features, silent feature substitution, mocked verification, AI-generated filler, missing wiring, layout gaps, and onboarding walls.
  • Contract review — critiques sprint contracts for testability by checking each acceptance criterion: is it testable via browser/curl/CLI? Is the rubric tie-in correct? Is the scope specific enough?
  • Generator feedback — for each FAIL or WARN, produces one-line specific fix guidance the developer can act on immediately.

Behavioral Checklist

  • [x] Drives the running build via browser, curl, or CLI — never issues verdicts from static code review
  • [x] Grades against rubric anchors, not personal intuition
  • [x] Assumes bugs exist — leniency is treated as a failure mode
  • [x] If unsure, marks WARN — never PASS
  • [x] Every verdict requires evidence (screenshot, log, command output)
  • [x] Hunts for stub features, mocked verification, and silent feature substitution
  • [x] Produces specific fix guidance for each FAIL or WARN
  • [x] Reviews sprint contracts for testability before code is written

Common Use Cases

ScenarioWhat the evaluator does
Harness sprint iterationDrives the running build, grades against rubric, provides PASS/WARN/FAIL verdict with evidence
Frontend feature verificationUses browser automation to navigate, click, and capture screenshots as evidence
API endpoint verificationUses curl/httpie to probe endpoints, captures response bodies and status codes
Sprint contract reviewChecks each acceptance criterion for testability, rubric alignment, and scope clarity
Detecting stub featuresClicks buttons and verifies handlers are wired — catches "button exists but nothing happens"

Failure Modes the Evaluator Hunts

  • Stub features — button exists but no handler is wired
  • Silent feature substitution — spec says real-time but implementation uses polling
  • Mocked verification — tests pass against mocks while real endpoints return 500
  • AI slop — generic purple gradients, stock illustrations, placeholder copy
  • Missing wiring — frontend renders state but API is never called
  • Layout gaps — no empty, loading, or error states implemented
  • Onboarding walls — 4-step required signup before any value is delivered

Pro Tips

Scope Reviews Strategically

The evaluator has a maximum of 15 criteria per session to prevent context overflow. For large rubric compositions, split across multiple sessions and merge verdicts. Prioritize the criteria most likely to fail — stub features and missing wiring are the most common issues.

Combine with the Fix Workflow

When the evaluator issues a FAIL, it provides specific one-line fix guidance. This pairs naturally with the mk:fix skill — take the evaluator's FAIL findings and route them directly to the developer as targeted fix tasks rather than vague "make it work" instructions.

Key Takeaway

The evaluator exists because code that "looks correct" can still be broken in production. By requiring runtime evidence for every verdict and treating leniency as a failure mode, it catches the class of bugs that static code review misses — the ones where the code compiles but the product does not work.

  • developer — receives FAIL/WARN feedback and iterates in the generator loop
  • reviewer — complements the evaluator by auditing code structure (reviewer = code quality, evaluator = product behavior)
  • orchestrator — routes to the evaluator during harness pipeline or on mk:evaluate invocation
  • shipper — receives handoff after evaluator issues PASS verdict

Released under the MIT License.