Cookbook

Tame Claude Code with stagent

Scenario 01

Build a deployable v1

Plan, build, verify, review, QA and deploy in one continuous cycle

Symptom

“Build me a v1 of this app.” You get 20 half-stubbed files, no clear deploy story, and the inevitable “almost done” loop. By the time you sit down to actually ship, your local branch and what's running in production have drifted onto two different roads — if production exists at all.

Why prompts don't fix it

v1 means deployable, not just compiled. Without one workflow that runs all the way through deploy, the agent stops at “tests pass” or “code looks clean” and you do the deploy by hand a week later. The intent (“ship a v1”) and the agent's stopping condition (“code written”) never line up.

“build me a v1 of <app idea>”

Workflow

Six stages run in sequence: planning interview → executing (subagent writes code) → verifying (quick tests) → reviewing (adversarial code review) → qa-ing (Playwright user-journey tests) → deploy (Vercel CLI to production). The execute/verify/review/qa inner loop converges on a clean build before deploy runs; deploy is interruptible so you can supply env vars or run <c>vercel login</c> mid-flow.

cloud://demo

/stagent:start --flow=cloud://demo <task_description>

Scenario 02

Hit a goal, verified

Loops execute → evaluate until an independent evaluator confirms the goal is met

Symptom

You hand the agent a goal ('improve homepage conversion', 'make this slow query fast', 'pass this onboarding flow'), it does some work and reports 'done' — but nobody actually checked whether the goal was met. Either the success bar was never written down, or 'looks roughly fine' got waved through.

Why prompts don't fix it

'Did some work' and 'goal achieved' are different things. 'Re-check before you tell me you're done' to an agent means 'glance at it once more, then say done' — not 'go back, re-run, compare against the bar, and iterate'. Without an evaluator standing between execute and complete, the agent always wins the argument with itself.

“only mark this done when the goal is actually achieved” / “don't forget to verify”

Workflow

Briefing turns the goal into verifiable success criteria you sign off on. Then an execute → evaluate loop runs: the evaluator (a separate subagent that didn't write the code) checks the criteria against fresh evidence; PASS terminates, FAIL bounces back to execute with the gap spelled out — until the bar is actually cleared.

cloud://jie-worldstatelabs/goal-pursuit

/stagent:start --flow=cloud://jie-worldstatelabs/goal-pursuit <task_description>

Scenario 03

Research before coding

Forces a cited research pass + plan before any line of code

Symptom

Asked to add a feature, the agent doesn't grep the codebase, doesn't check the official docs, just writes. The API version's wrong, an existing helper gets re-implemented, the new code clashes with the project's style.

Why prompts don't fix it

Saying "look at the existing code first" lasts about 3 turns. By the 10th decision the agent's back to writing from memory — and it judges "have I looked enough?" by feel, not by listing evidence.

"Read the existing code first" / "Check context7 for the API"

Workflow

A subagent greps the codebase, reads related modules, and consults docs, then writes a plan that cites every finding by ID. A separate executor implements strictly to the plan, with a verify loop until clean.

cloud://jie-worldstatelabs/research-based-implemantion

/stagent:start --flow=cloud://jie-worldstatelabs/research-based-implemantion <task_description>

Scenario 04

Codebase deep-dive in parallel

Decomposes the question into N tasks, fans out N read-only sub-investigators, synthesizes one report

Symptom

Asked to understand a slice of an unfamiliar codebase — say “how does the rendering pipeline work end-to-end” — the agent serially Reads dozens of files in one window, runs out of context, and ends with shallow citations or hallucinated paths.

Why prompts don't fix it

A single agent doing depth-first investigation hits its context ceiling fast on real codebases. By the time it's read 30 files the early ones have rolled out of memory, citations get fuzzy, and what should be 5 focused answers becomes one rambling 4000-word reply that misses half the question.

“Explain how the rendering pipeline works in this repo”

Workflow

decompose splits the unknown into 3–8 well-scoped sub-questions. investigate fans out one read-only sub-investigator per question in parallel — each reads code only in its slice and returns a focused finding with file:line citations. synthesize stitches the per-task findings into one end-to-end answer that covers every question with evidence.

cloud://jie-worldstatelabs/parallel-research

/stagent:start --flow=cloud://jie-worldstatelabs/parallel-research <task_description>

Scenario 05

Stay strictly in scope

Locks scope upfront; diff hunks outside the contract are auto-reverted

Symptom

You ask for a fix to bug A. The agent also refactors B, renames C, and adds a helper for D. The PR is unreviewable, and a rollback takes the whole basket.

Why prompts don't fix it

"Only change what's necessary" is too subjective. Mid-write the agent thinks "this would be cleaner if I tidied this up too" — that's responsibility from its angle, scope explosion from yours.

"Only touch what you absolutely have to"

Workflow

Pre-declared scope contract + hunk-level diff audit. Any change outside the declared scope is reverted automatically before it reaches the final diff.

cloud://jie-worldstatelabs/scope-discipline

/stagent:start --flow=cloud://jie-worldstatelabs/scope-discipline <task_description>

Scenario 06

Refactor without breaking behavior

Captures must-preserve behaviors as executable invariants; rollback on any drift

Symptom

You ask Claude to refactor a hot module 'same behavior, cleaner code'. It rewrites the whole file, your tests pass, and a week later prod breaks because Claude quietly dropped a retry, swallowed an exception, or changed timing. The behavior you cared about wasn't in the test suite to begin with.

Why prompts don't fix it

A prompt can't make Claude pin down which behaviors must be preserved as executable checks before any edit, decompose the refactor into atomic rollback-able steps each with its own invariant subset, retry-with-rollback on per-step failure, then run a full invariant sweep at the end with auto-repair on regressions. Without those gates, the refactor is one big diff that either passes or breaks together — and 'tests pass' is not the same as 'invariants hold'.

Refactor this module — same behavior, cleaner code.

Workflow

Captures must-preserve behaviors as executable invariants, decomposes into atomic rollback-able steps, retries with rollback on per-step failure, and ends with a full invariant sweep that auto-repairs regressions.

cloud://jie-worldstatelabs/invariant-safe-refactor

/stagent:start --flow=cloud://jie-worldstatelabs/invariant-safe-refactor <task_description>

Scenario 07

Fix the root cause, not the symptom

Reproduce → cited root cause → fix → independent re-validation

Symptom

Error? The agent adds `try/except: pass`. Assertion fails? It loosens the assertion. Test flakes? It adds a retry. The actual root cause is never found.

Why prompts don't fix it

"Find the root cause, don't add try/catch" — the agent thinks it already analyzed and that the except is "robust". You can't convince it of an attitude.

"Don't catch the exception, find the root cause"

Workflow

Reproduce → diagnose with cited evidence → fix → independent re-validation. Each stage gates the next; no fix lands without a cited root cause.

cloud://jie-worldstatelabs/root-cause-analysis

/stagent:start --flow=cloud://jie-worldstatelabs/root-cause-analysis <task_description>

Scenario 08

Repo-wide bug audit

Auto-discovers your toolchain, runs it repo-wide, applies surgical fixes until clean

Symptom

You ask Claude to find bugs across the app. It scans a few files it picked at random, lists generic pitfalls (race conditions! null checks!), edits one or two of them, and stops. Your actual test / lint / type-check / static-analysis toolchain never runs; nothing is re-verified after the edits; and you have no idea whether one round was enough.

Why prompts don't fix it

A prompt can't make Claude (a) auto-discover and run your project's actual detection toolchain across the whole repo, (b) emit findings as a structured ledger instead of prose, (c) verify every fix by replaying the exact detection commands plus regression checks, and (d) keep looping until verification is clean. Without those gates you get a one-round vibes-based audit that quietly leaves real bugs in the codebase.

Audit my app for bugs and fix what you find.

Workflow

Auto-discovers your project's real test / lint / type-check / static-analysis toolchain, runs it across the whole repo, applies surgical fixes, then replays the detection commands plus regression checks until verification is clean.

cloud://jie-worldstatelabs/bug-hunt

/stagent:start --flow=cloud://jie-worldstatelabs/bug-hunt <task_description>

Scenario 09

Test before code (real TDD)

Proves a failing test in stderr before any implementation lands

Symptom

You say "test first." The agent says "got it." Then writes the implementation directly and leaves the test as a TODO.

Why prompts don't fix it

There's no state machine actually checking that a failing test ran before code went in. "Promise to TDD" doesn't survive the first refactor.

"Strict TDD please — RED before GREEN"

Workflow

Three stages: write a test that's proven failing in stderr → make the minimum change to pass it → refactor only with the whole suite green.

cloud://jie-worldstatelabs/strict-tdd

/stagent:start --flow=cloud://jie-worldstatelabs/strict-tdd <task_description>

Scenario 10

Build an end-to-end test suite

Builds a real user-journey test suite; classifies each failure as test-side or app-side

Symptom

You ask Claude for 'a thorough end-to-end test suite covering every feature'. Back come a dozen tests scattered across unit / integration / e2e categories — half of them poking internal helpers instead of real user flows. When a test fails, Claude can't say whether the test is wrong or the app is buggy, and silently edits production code to make tests green.

Why prompts don't fix it

A prompt can't pin down which flows count as 'the full user journey', which test category and framework each one belongs in, or the contract that the generator's only job is to make the suite itself green — with app-side bugs going to a separate ledger instead of blocking. Without those gates, Claude conflates test-side and app-side failures and over-edits production code to chase a pass.

Write a thorough end-to-end test suite for this app — cover every feature.

Workflow

Plan stage interviews scope, category, framework, and pass criteria; the writer subagent touches only test/; dry-run classifies each failure as test-side or app-side, looping until the suite itself runs cleanly. App bugs are logged separately and never block.

cloud://jie-worldstatelabs/prep-test-suite

/stagent:start --flow=cloud://jie-worldstatelabs/prep-test-suite <task_description>

Scenario 11

Screenshot QA after UI change

Scores 8 visual dimensions on real Playwright screenshots until all pass

Symptom

After a UI change the agent inspects the code and declares it done. Visual regressions only show up when YOU open the browser.

Why prompts don't fix it

Code-only judgment can't catch alignment, contrast, or mobile reflow regressions. "Check it in the browser" is the right idea but nothing forces it to actually do it.

Workflow

Translates fuzzy UI goals into machine-verifiable assertions, then scores 8 dimensions from real Playwright screenshots and loops until every dimension clears its threshold.

cloud://jie-worldstatelabs/ui-eval-gate

/stagent:start --flow=cloud://jie-worldstatelabs/ui-eval-gate <task_description>

Scenario 12

Measure-then-optimize perf

Locks a measurement contract; fixes that don't beat baseline by the threshold are auto-reverted

Symptom

You ask Claude to 'make /api/search faster'. It immediately suggests adding a cache, memoizing a hot loop, and batching DB calls — implements all three at once and tells you 'should be much faster now'. Nobody ran the slow request before, nobody runs it after. Whether tail latency went down, stayed flat, or regressed, you never find out.

Why prompts don't fix it

A prompt can't lock a measurement contract — repro script, primary metric (p50/p95/p99), watchdog metrics that must NOT regress, statistical significance threshold, run counts — before any code is touched. It can't constrain Claude to fix one hotspot at a time, re-measure with identical parameters, gate accept/reject mechanically (improvement AND significance AND watchdog budgets), and auto-revert + try the next candidate when a fix fails the gate. Without those, perf work compounds into plausible-looking changes that nobody can prove improved anything.

Make /api/search faster — this endpoint is slow.

Workflow

Locks a measurement contract (repro / metrics / threshold / runs) before any code change, baselines with profiler runs, fixes one hotspot at a time, re-measures with a significance test, and auto-reverts + tries the next candidate on reject.

cloud://jie-worldstatelabs/perf-gate

/stagent:start --flow=cloud://jie-worldstatelabs/perf-gate <task_description>

Scenario 13

Compliance gate

Walks a numbered regulatory checklist with cited evidence; ships only at zero violations

Symptom

Asking Claude 'is this app release-ready?' produces a confident-sounding summary — but there was never a real checklist to grade against. Half-relevant rules get cited (privacy policy! analytics consent!), the regulations that actually bind this release — driven by market, age target, deployment region, SDK set — get skipped. Severity isn't classified and no certificate survives for the next reviewer.

Why prompts don't fix it

A prompt can't tell Claude which regulations apply. Whether COPPA, GDPR, App Store / Play Store policies, or industry-specific rules bind this release depends on facts spread across the manifest, store listing, and privacy text. Without a discovery pass that infers the applicable regs first, Claude either over-cites generic rules or misses the specific clauses that block release.

Is this app ready to ship? Check it against any compliance rules that apply.

Workflow

discover_scope derives a numbered regulatory checklist from market / age / region / SDK; audit walks every C-ID with cited evidence; fix auto-resolves CRITICAL+HIGH; loops until zero violations and ships a release certificate.

cloud://jie-worldstatelabs/release-compliance-gate

/stagent:start --flow=cloud://jie-worldstatelabs/release-compliance-gate <task_description>

Workflow vs direct prompt

	Direct prompt	Workflow
Rules retention	Diluted with context	Reloaded each stage
Pass/fail judgment	Agent self-approves	State machine enforces
Cross-session	Re-explain every time	Publish to hub once
Artifact audit	None	Per-stage report
Visualization	Chat log	Browser timeline + diff + artifacts

Browse what others built

Every scenario above is a starting template. The template hub collects everything anyone has published — /stagent:start --flow=cloud://author/name to pull one and run it.