Skip to content

Trace & Benchmark Meta-Loop

The Meta-Improvement Cycle

Every harness component encodes an assumption. Assumptions age. The meta-loop is how MeowKit keeps the harness calibrated as models improve — replacing assumption-driven decisions with evidence from real runs.

The loop is intentionally slow and human-gated. Trace content is DATA (per injection-rules.md). No suggestion from trace analysis is applied automatically — ever.

Trace Log Format

Location: .claude/memory/trace-log.jsonl

Append-only JSONL. Every record has the same top-level envelope:

json
{
  "schema_version": "1.0",
  "ts": "2026-04-08T14:30:00Z",
  "event": "build_verify_result",
  "run_id": "260408-1430-myapp",
  "harness_version": "3.0.0",
  "model": "claude-sonnet-4-5",
  "density": "FULL",
  "data": { ... event-specific payload ... }
}

Record types emitted by hooks and harness steps:

Event typeEmitterWhen
build_verify_resultpost-write-build-verify.shCompile/lint failure on a file write
loop_warningpost-write-loop-detection.shEdit count hits threshold (N=4 or N=8)
pre_completion_blockpre-completion-check.shSession blocked for missing verification
session_endpost-session.shSession closes normally
eval_verdictmk:evaluate stepEvaluator emits PASS/WARN/FAIL with rubric scores
benchmark_resultmk:benchmarkCanary suite run result

Append safety: append-trace.sh uses flock for atomic appends (falls back to a plain append on macOS where flock(1) is not in the base install). Payloads are secret-scrubbed via lib/secret-scrub.sh before write. Records are never mutated after append.

Scatter-Gather Analysis

/mk:trace-analyze [--runs N] (default N=20) is a step-file workflow:

Frequency threshold: a pattern must appear in ≥3 records before becoming a suggestion. Single-occurrence anomalies are noise — the threshold prevents overfit to one bad run.

HITL gate is mandatory. Each suggestion is presented individually via AskUserQuestion. Bulk-approve is not available. Trace content is DATA; suggestions derived from it are hypothesis, not ground truth.

Output (written to plans/{date}-trace-analysis/):

  • findings.md — patterns above threshold
  • suggestions-draft.md — proposals before human review
  • suggestions.md — approved only
  • rejected.md — rejected with reasons
  • analysis.md — final human-readable summary

Skip conditions: fewer than 3 trace records (insufficient signal), or last analysis ran within 24h with no new records.

Benchmark Canary Suite

/mk:benchmark provides the empirical signal that the dead-weight audit consumes.

Subcommands:

CommandTasksCost capPurpose
/mk:benchmark run5 quick tasks≤ $5Regression check after a harness change
/mk:benchmark run --full5 quick + 1 heavy≤ $30Full dead-weight audit cycle
/mk:benchmark compare A BFreeDelta table between two prior runs

Quick tier spec files (.claude/benchmarks/canary/quick/): react-component, api-endpoint, bug-fix, refactor, tdd-feature. Each is a focused 1-sprint task runnable through mk:cook.

Full tier adds 06-small-app-build-spec.md — a real product build that runs through mk:autobuild. Requires --full explicitly to prevent accidental cost burn.

Results are written to .claude/benchmarks/results/{run-id}.json and appended to trace-log.jsonl as benchmark_result events — so trace-analyze can correlate benchmark scores with harness run patterns.

Example delta table from /mk:benchmark compare:

| Task                  | Baseline | Disabled | Δ      |
|-----------------------|----------|----------|--------|
| 01-react-component    | 0.92     | 0.88     | -0.04  |
| 02-api-endpoint       | 0.85     | 0.91     | +0.06  |
| 03-bug-fix            | 1.00     | 1.00     |  0.00  |
| TOTAL                 | 0.89     | 0.91     | +0.02  |

A positive Δ when a component is disabled means that component is hurting output — prune candidate.

Dead-Weight Audit

The benchmark exists to serve the dead-weight audit. The 6-step playbook (from docs/dead-weight-audit.md):

  1. List components — refer to the Assumption Registry in the audit doc
  2. Establish baseline/mk:benchmark run --full with the component enabled; capture run ID
  3. Disable the component — env var flag, comment out hook registration, or comment out rule import
  4. Re-run/mk:benchmark run --full with component disabled; capture run ID
  5. Compare/mk:benchmark compare {baseline-run-id} {disabled-run-id}; examine delta
  6. Decide based on measured_delta = baseline_avg − disabled_avg:
    • Delta ≤ 0 → PRUNE candidate (component not helping or actively hurting)
    • 0 < Delta < 0.02 → WATCH (revisit next cycle)
    • Delta ≥ 0.02 → KEEP with evidence

Re-enable disabled components before exiting — the audit is non-destructive by design.

When to run: every major model release, quarterly regardless, when mk:trace-analyze surfaces a recurring no-value pattern, or when the harness "feels heavy."

Auto-detection caveat: post-session.sh tries to detect model changes via MEOWKIT_MODEL_HINT but Claude Code does not export CLAUDE_MODEL to hooks. Manual trigger is required unless MEOWKIT_MODEL_HINT is set.

Full playbook: docs/dead-weight-audit.md.

Log Rotation

When trace-log.jsonl exceeds 50MB, append-trace.sh rotates it:

.claude/memory/trace-log.jsonl              ← active log (reset to empty after rotation)
.claude/memory/trace-log.{YYMMDD-HHMMSS}.jsonl.gz  ← compressed archive

Rotation is triggered on every append that finds the file over the size limit. No cron required.

Secret Scrubbing

Every trace write passes through lib/secret-scrub.sh before flushing to the log. The scrubber strips patterns that look like API keys, tokens, and credential strings.

This is a hard requirement — trace records are often read by mk:trace-analyze researchers and included in plan outputs. A single secret in a trace record could propagate widely. The scrubber runs unconditionally; there is no bypass.

Canonical Sources

  • .claude/memory/trace-log.jsonl — append-only trace store
  • .claude/hooks/append-trace.sh — trace writer (flock + scrub + rotation)
  • docs/dead-weight-audit.md — full 6-step audit playbook + assumption registry
  • .claude/skills/trace-analyze/SKILL.md — scatter-gather workflow spec
  • .claude/skills/benchmark/SKILL.md — canary suite spec

Released under the MIT License.