Stage: scoping

Runtime config (canonical): workflow.json → stages.scoping

Purpose: lock the measurement contract with the user — domain, repro script, primary + watchdog metrics, threshold, profiler, and N (warmup + measurement runs). Validate the repro is deterministic and runnable before any fix work begins. Output artifact: write to the absolute path provided in your prompt Valid results this stage writes: pending (contract drafted, awaiting user approval), approved (user explicitly confirmed)

<HARD-GATE> Do NOT transition out of this stage until the user explicitly confirms the contract. Write `result: approved` only after they have said so. </HARD-GATE>

This is an interruptible stage — the stop hook allows natural pauses for Q&A.

Note: by the time you read this, state.md already exists with status: scoping and the current epoch is set. Read state.md for the epoch, then immediately write the artifact at the path shown in your I/O context with result: pending so the stop hook knows the stage is in progress. Iterate, then overwrite with result: approved when the user signs off.

Step 1 — Repo recon

Scan the project worktree to detect stack:

package.json → Node.js (look for "scripts": {"start", "build", "test", "bench"} and check dependencies for vite, next, webpack, clinic, hyperfine-as-script, etc.)
pyproject.toml / requirements.txt → Python (look for pytest-benchmark, scipy, py-spy)
Cargo.toml → Rust (look for [[bench]], criterion, flamegraph)
go.mod → Go (go test -bench, pprof)
pubspec.yaml → Flutter / Dart
Makefile → check for bench, profile, test targets
existing bench/, benchmarks/, perf/ directories — surface them; reuse if compatible

Record findings in repo_context = {language, runtime, package_manager, existing_benchmarks}.

Step 2 — Ask clarifying questions (one per message, multiple choice when possible)

Cover at minimum:

Domain — pick one: api_latency / cli_startup / render_fps / build_time / bundle_size / memory_peak / db_query / custom. If custom, ask for a one-line description of what's being measured.
Repro script — what command reproduces the workload? Constraints:
- Must be runnable from project root (or document cd prefix)
- Must be deterministic (fixed input, fixed seed, fixed dataset). If env varies, ask the user to pin it.
- Must complete in a reasonable time (< 5 min per run is recommended; longer is allowed but slows the loop)
Primary metric — {name, unit, lower_is_better}. Examples: {name:"p95_latency_ms", unit:"ms", lower_is_better:true}, {name:"fps_min", unit:"fps", lower_is_better:false}. Exactly ONE primary.
Watchdog metrics —副指标 list, each {name, unit, lower_is_better, max_regression_pct}. Defaults to suggest based on domain:
- api_latency / cli_startup → memory_peak (10%)
- render_fps → frame_p99_ms (15%), memory_peak (10%)
- build_time → bundle_size (5%)
- bundle_size → build_time (10%)
- memory_peak → p95_latency_ms or domain-equivalent throughput (10%)
- db_query → lock_wait_ms (10%) User can add/remove. Watchdogs are optional; default empty if user opts out.
Threshold — {type, pct_min, absolute_min, significance_required}. Default {type:"pct", pct_min:20, significance_required:true}. Confirm or adjust.
Run counts — repro_warmup_runs (default 1) and repro_measurement_runs (default 5, min 3). Confirm or adjust based on repro cost.

Profiler — recommend per stack from this table, then let the user pick:

Domain / stack	Recommended tool
Node.js CPU-bound	`node --prof` + `clinic flame` / `clinic doctor`
Node.js startup	`node --cpu-prof` + `hyperfine` for wall time
Browser render / FPS	Chrome DevTools `mcp__plugin_chrome-devtools-mcp_chrome-devtools__performance_start_trace`
Python	`py-spy record -o profile.svg -- <cmd>` or `cProfile` + `snakeviz`
Go	`go test -cpuprofile=cpu.prof` or `runtime/pprof`
Rust	`cargo flamegraph` or `samply`
Bundle size	`vite-bundle-visualizer` / `source-map-explorer`
CLI wall time	`hyperfine --warmup 1 --runs N -- '<cmd>'`
DB query	`EXPLAIN ANALYZE` + `pg_stat_statements`
Memory peak (C/C++)	`valgrind --tool=massif`
Memory peak (Node)	`node --heap-prof`
Memory peak (Python)	`tracemalloc`
Generic macOS	Instruments (Time Profiler / Allocations)

Capture profiler = {tool, command, output_path} — command is the full shell command including the output file path.

Significance test — default mann_whitney (uses scipy if importable; falls back to mean_2sigma). Ask only if the user has a strong preference.
Secrets — if the repro needs credentials/API keys, ask for a secrets_file_path (path reference only — sensitive values stay local, never enter cloud artifacts). Otherwise leave blank.

Stop asking when you have enough; usually 4–7 turns covers it.

Step 3 — Pre-flight

Run the proposed repro_script once to confirm it works end-to-end. You don't need final timing data here — you're checking:

Exit code 0
The metric is observable in the output (or the profiler/timer surrounds it correctly)
Determinism plausibly holds (no obvious randomness, no time-dependent input)

If the repro fails, iterate with the user. Do NOT advance.

Step 4 — Write the contract

Write the output artifact at the path in your I/O context:

markdown

---
epoch: <epoch from state.md>
result: pending
---
# Scoping Report: <topic>

## Repo Context
- language: <...>
- runtime: <...>
- package_manager: <...>
- existing_benchmarks: <path or "none">

## Contract

### Domain
<one of the enum values>

### Repro
- script: `<command>`
- warmup_runs: <N>
- measurement_runs: <N>

### Metrics
- primary: `{name, unit, lower_is_better}`
- watchdogs:
  - `{name, unit, lower_is_better, max_regression_pct}`
  - ...

### Threshold
- type: <pct|absolute|both>
- pct_min: <number>
- absolute_min: <number>
- significance_required: <bool>

### Significance Test
<mann_whitney | mean_2sigma>

### Profiler
- tool: <name>
- command: `<full shell command writing to output_path>`
- output_path: `<relative or absolute path>`

### Secrets
<path_reference or "none">

## Pre-flight result
<exit code, observed metric value, determinism notes>

Step 5 — Approval

"Contract drafted. Please review and confirm to start measurement, or request changes."

If the user requests changes, iterate the body. Keep result: pending.

Step 6 — Finalize

Once the user explicitly approves, edit the artifact: change result: pending → result: approved. That is the only action needed here. The main loop reads the artifact's result: and calls update-status.sh to advance — do NOT call it yourself.

Run in Claude Code

State machine

scoping.md