mk:benchmark — Harness Canary Suite
What This Skill Does
Measures harness performance against a small set of ground-truth tasks. Provides the empirical signal that the dead-weight audit (docs/dead-weight-audit.md) consumes to make load-bearing decisions about each harness component.
When to Use
| Invocation | Tier | Tasks | Cost Cap | Description |
|---|---|---|---|---|
/mk:benchmark run | quick | 5 | ≤$5 | Runs the quick canary suite (default) |
/mk:benchmark run --full | full | 6 | ≤$30 | Quick tier + 1 heavy app-build task |
/mk:benchmark compare A B | — | — | free (reads cache) | Delta table between two prior runs |
Activate before applying a harness change (baseline) and after (verify delta). Also used during the dead-weight audit playbook for component enable/disable cycles.
Skip when: The harness has been run end-to-end manually within the last hour (use that data instead), or budget cap would be hit before the suite finishes.
Core Capabilities
- Run canary suite: Executes spec files from
.claude/benchmarks/canary/quick/(andfull/with--full) viamk:autobuild, recording per-task verdicts, scores, duration, and cost. - Compare runs: Reads two prior run JSONs from
.claude/benchmarks/results/and emits a markdown delta table showing per-task score and cost changes. - Persist baselines: After each completed run, appends a cost baseline to
.claude/memory/cost-log.jsonand emits abenchmark_resultevent to.claude/memory/trace-log.jsonl.
Example Prompt
Run the benchmark canary suite to baseline current harness performance. Then compare the results against the last run to see the delta.Arguments
| Argument | Effect |
|---|---|
run (default) | Execute quick tier (5 tasks, under .claude/benchmarks/canary/quick/) |
run --full | Execute quick + heavy tier (6 tasks total, includes .claude/benchmarks/canary/full/) |
compare RUN-ID-A RUN-ID-B | Diff two prior run JSONs |
--full is strictly opt-in. The heavy task (06-small-app-build) triggers mk:autobuild which can run for hours — the script refuses without the flag to prevent accidental cost burn.
Hard Constraints
- Quick tier ≤$5 total cost. Hard block if projected cost exceeds.
- Full tier ≤$30 total cost. Hard block if projected cost exceeds.
- Results recorded in trace-log.jsonl as
event=benchmark_resultrecords, tagged withbenchmark_version+harness_version+model_version. - NOT a replacement for unit tests. Harness-level measurement only.