What MeowKit is, what problem it solves, and why the harness matters more than the model.

The problem

AI coding tools are powerful but undisciplined. Given "build a payment system," they write code immediately: no plan, no tests, no security review. The code compiles but has no tests, hardcoded secrets, and ships directly to main.

A single "implement this feature" prompt can produce code that compiles but has no tests, no review, and secrets hardcoded in source.

What MeowKit does

MeowKit is an AI agent toolkit for Claude Code that adds enforced discipline (hard gates, TDD, security scanning, and human approval) so your coding assistant ships production-quality code instead of untested prototypes.

It installs a .claude/ directory that Claude Code reads automatically. No executable runtime. No external services. Just structured conventions that shape how the AI works.

Core thesis

The model is a commodity. The harness is the product.

With the same model, the same task, and the same compute budget, changing only the environment design raises performance by 64% (SWE-agent, NeurIPS 2024). The model is not the bottleneck. The environment is.

Claude Code is already a harness. It manages context, tools, and sessions. MeowKit is a second harness on top of it: structured workflows, quality gates, memory, multi-agent coordination, and hook-based automation. Claude Code handles the mechanics. MeowKit handles the strategy: what to build, when to stop, how to verify.

How it differs from raw Claude Code

Concern	Raw Claude Code	With MeowKit
Planning	Starts coding immediately	Creates and gets approval for a plan first
Testing	Tests optional, often skipped	TDD opt-in via `--tdd`, strict failing-test-first when enabled
Security	Relies on model knowledge	4-layer defense + security agent + preventive hooks
Review	Ask "review this" and hope	3 parallel adversarial reviewers + triage step
Shipping	`git add -A && git push`	Conventional commits, PR, CI verification, rollback docs
Memory	Forgets everything between sessions	Persists lessons, patterns, and costs across sessions
Model selection	Same model for everything	Domain-adaptive routing, so fintech forces COMPLEX tier
Architecture decisions	Ask and hope for the best	Party Mode: 2-4 agents deliberate, forced synthesis

Architecture at a glance

.claude/
├── agents/          Specialist agents for each phase
├── skills/          Domain skills loaded on demand
├── hooks/           Preventive lifecycle hooks
├── rules/           Enforcement rules loaded every session
├── memory/          Cross-session learnings
└── settings.json    Hook registrations + permissions

CLAUDE.md            Entry point, read at session start

Design principles

Every mistake → a permanent fix

When an agent makes an error, MeowKit builds a hook, rule, or gate so it never makes that mistake again. build-verify.cjs exists because agents introduced syntax errors that cascaded. loop-detection.cjs exists because agents edited the same file 20+ times without progress. Each handler is a crystallized lesson.

Gates are discipline, not suggestions

Gate 1 (plan approved) and Gate 2 (review approved) require explicit human sign-off. No --skip-gates flag exists. No agent can self-approve. Hooks block file writes before Gate 1 passes, so the agent cannot edit source files until a plan is approved.

Security is architecture, not afterthought

Three layers: behavioral rules, preventive hooks (that block .env reads and unapproved writes), and observational hooks (that scan written files). Security hooks are never routed through the dispatcher, so if the dispatcher crashes, security hooks still fire. All file content is DATA; only CLAUDE.md and .claude/rules/ contain instructions.

Dead weight must be pruned

Every harness component encodes an assumption about what the model cannot do. When a new model ships, that assumption may be wrong. Scaffolding that helped Opus 4.5 may hurt Opus 4.7. Adaptive density adjusts automatically: Haiku gets MINIMAL, Sonnet gets FULL, Opus 4.6+ gets LEAN. Every component is measured, and one that costs more than it saves is removed.

Use the cheapest tool that solves the problem

A build-verify linter check ($0) catches syntax errors before they cascade into $5 debugging sessions. A browser health check (~~$1) catches blank pages before a full evaluator (~~$5) is needed. Token efficiency is economic discipline.

Verify by behavior, not by reading code

Tests can pass against mocks while production returns 500. The evaluator must click through the running app: browser navigation, curl against live endpoints, CLI invocation. Static-analysis-only verdicts are rejected.

TDD is opt-in

TDD enforcement moved from default-on to opt-in via --tdd flag or MEOWKIT_TDD=1. Strict TDD added friction for spike work and prototypes. Production-quality work should still enable --tdd.

Learn from every session

Topic files in .meowkit/memory/ are read on demand by consumer skills at task start: fixes.md for bug patterns, review-patterns.md for observations, architecture-decisions.md for design choices. There is no auto-injection pipeline.

Load only what's needed

Skills activate by task domain, not all at once. Each skill's SKILL.md is a compact router, with detailed procedures loaded on demand. The hook system routes events to only matching handlers. This progressive disclosure saves ~70% context per invocation.

Trade-offs we accept

Each of these is a decision that costs something. They are listed because the cost is real, not because it is negligible.

We chose	Over	Because
Human gates (slower)	Auto-approve (faster)	Agent self-approval drifts lenient
Opt-in TDD	Mandatory TDD	Strict TDD costs more than it returns on spikes and prototypes
Independent security hooks	One dispatcher for everything	If the dispatcher crashes, security still fires, which is worth the duplication
Exclusive file ownership per agent	Agents editing wherever they need to	Two agents on one file lose each other's work
Behavioral output limits	Hook-based truncation	Hooks append to tool output, they cannot replace it
Session budget warnings	No budget tracking	A runaway session costs more than a false alarm

What MeowKit does NOT do

No proprietary formats. Standard Markdown and JSON
No telemetry. All data stays project-local
No external services. Zero third-party API dependencies for core workflow
No model lock-in. Adaptive density works across Haiku, Sonnet, and Opus tiers
No experimental surface. Nothing ships to the toolkit that is not ready to use

Where MeowKit came from

MeowKit builds on ideas from earlier toolkits, and it is worth naming them:

ClaudeKit gave this its shape. It was the original structured skill system for Claude Code. Its orchestration protocols, development rules, and cook workflow are what MeowKit's phase-gated pipeline descends from.
Gstack is a broad skill collection covering headless browser automation and QA testing, and the source of the "search before building" habit MeowKit inherits.
Aura-frog is an autonomous agent framework whose multi-agent coordination and self-healing patterns shaped MeowKit's agent architecture and escalation protocols.

What MeowKit adds is enforcement: gates that block rather than advise. The question it exists to answer is what happens when best practice stops being a suggestion.

MeowKit is MIT-licensed and community-driven. Skills, fixes, documentation, and unreasonable ideas for new agent workflows are all welcome. Open a PR.

Next steps

How It Works: architecture architecture walkthrough
Installation: get MeowKit running
Quick Start: your first task in 5 minutes

What Is MeowKit