Skip to content
Stagent
jie-worldstatelabs/perf-gatepublic

Profile, propose hotspots, fix one, re-measure, accept only if delta clears the threshold and is statistically significant; auto-revert on reject and try the next hotspot.

1
by jie-worldstatelabsupdated Apr 27, 20267 stages3 runs

Run in Claude Code

/stagent:start --flow=cloud://jie-worldstatelabs/perf-gate <task_description>

Paste in Claude Code and replace <task_description>

Template blueprint

State machine

Loading state machine…

Click any stage above to view its instructions below.

Stagescoping

scoping.md

inline· interruptible · transitions: approved → baseline

Stage: scoping

Runtime config (canonical): workflow.jsonstages.scoping

Purpose: lock the measurement contract with the user — domain, repro script, primary + watchdog metrics, threshold, profiler, and N (warmup + measurement runs). Validate the repro is deterministic and runnable before any fix work begins. Output artifact: write to the absolute path provided in your prompt Valid results this stage writes: pending (contract drafted, awaiting user approval), approved (user explicitly confirmed)

<HARD-GATE> Do NOT transition out of this stage until the user explicitly confirms the contract. Write `result: approved` only after they have said so. </HARD-GATE>

This is an interruptible stage — the stop hook allows natural pauses for Q&A.

Note: by the time you read this, state.md already exists with status: scoping and the current epoch is set. Read state.md for the epoch, then immediately write the artifact at the path shown in your I/O context with result: pending so the stop hook knows the stage is in progress. Iterate, then overwrite with result: approved when the user signs off.

Step 1 — Repo recon

Scan the project worktree to detect stack:

  • package.json → Node.js (look for "scripts": {"start", "build", "test", "bench"} and check dependencies for vite, next, webpack, clinic, hyperfine-as-script, etc.)
  • pyproject.toml / requirements.txt → Python (look for pytest-benchmark, scipy, py-spy)
  • Cargo.toml → Rust (look for [[bench]], criterion, flamegraph)
  • go.mod → Go (go test -bench, pprof)
  • pubspec.yaml → Flutter / Dart
  • Makefile → check for bench, profile, test targets
  • existing bench/, benchmarks/, perf/ directories — surface them; reuse if compatible

Record findings in repo_context = {language, runtime, package_manager, existing_benchmarks}.

Step 2 — Ask clarifying questions (one per message, multiple choice when possible)

Cover at minimum:

  1. Domain — pick one: api_latency / cli_startup / render_fps / build_time / bundle_size / memory_peak / db_query / custom. If custom, ask for a one-line description of what's being measured.

  2. Repro script — what command reproduces the workload? Constraints:

    • Must be runnable from project root (or document cd prefix)
    • Must be deterministic (fixed input, fixed seed, fixed dataset). If env varies, ask the user to pin it.
    • Must complete in a reasonable time (< 5 min per run is recommended; longer is allowed but slows the loop)
  3. Primary metric{name, unit, lower_is_better}. Examples: {name:"p95_latency_ms", unit:"ms", lower_is_better:true}, {name:"fps_min", unit:"fps", lower_is_better:false}. Exactly ONE primary.

  4. Watchdog metrics —副指标 list, each {name, unit, lower_is_better, max_regression_pct}. Defaults to suggest based on domain:

    • api_latency / cli_startupmemory_peak (10%)
    • render_fpsframe_p99_ms (15%), memory_peak (10%)
    • build_timebundle_size (5%)
    • bundle_sizebuild_time (10%)
    • memory_peakp95_latency_ms or domain-equivalent throughput (10%)
    • db_querylock_wait_ms (10%) User can add/remove. Watchdogs are optional; default empty if user opts out.
  5. Threshold{type, pct_min, absolute_min, significance_required}. Default {type:"pct", pct_min:20, significance_required:true}. Confirm or adjust.

  6. Run countsrepro_warmup_runs (default 1) and repro_measurement_runs (default 5, min 3). Confirm or adjust based on repro cost.

  7. Profiler — recommend per stack from this table, then let the user pick:

    Domain / stackRecommended tool
    Node.js CPU-boundnode --prof + clinic flame / clinic doctor
    Node.js startupnode --cpu-prof + hyperfine for wall time
    Browser render / FPSChrome DevTools mcp__plugin_chrome-devtools-mcp_chrome-devtools__performance_start_trace
    Pythonpy-spy record -o profile.svg -- <cmd> or cProfile + snakeviz
    Gogo test -cpuprofile=cpu.prof or runtime/pprof
    Rustcargo flamegraph or samply
    Bundle sizevite-bundle-visualizer / source-map-explorer
    CLI wall timehyperfine --warmup 1 --runs N -- '<cmd>'
    DB queryEXPLAIN ANALYZE + pg_stat_statements
    Memory peak (C/C++)valgrind --tool=massif
    Memory peak (Node)node --heap-prof
    Memory peak (Python)tracemalloc
    Generic macOSInstruments (Time Profiler / Allocations)

    Capture profiler = {tool, command, output_path}command is the full shell command including the output file path.

  8. Significance test — default mann_whitney (uses scipy if importable; falls back to mean_2sigma). Ask only if the user has a strong preference.

  9. Secrets — if the repro needs credentials/API keys, ask for a secrets_file_path (path reference only — sensitive values stay local, never enter cloud artifacts). Otherwise leave blank.

Stop asking when you have enough; usually 4–7 turns covers it.

Step 3 — Pre-flight

Run the proposed repro_script once to confirm it works end-to-end. You don't need final timing data here — you're checking:

  • Exit code 0
  • The metric is observable in the output (or the profiler/timer surrounds it correctly)
  • Determinism plausibly holds (no obvious randomness, no time-dependent input)

If the repro fails, iterate with the user. Do NOT advance.

Step 4 — Write the contract

Write the output artifact at the path in your I/O context:

markdown
---
epoch: <epoch from state.md>
result: pending
---
# Scoping Report: <topic>

## Repo Context
- language: <...>
- runtime: <...>
- package_manager: <...>
- existing_benchmarks: <path or "none">

## Contract

### Domain
<one of the enum values>

### Repro
- script: `<command>`
- warmup_runs: <N>
- measurement_runs: <N>

### Metrics
- primary: `{name, unit, lower_is_better}`
- watchdogs:
  - `{name, unit, lower_is_better, max_regression_pct}`
  - ...

### Threshold
- type: <pct|absolute|both>
- pct_min: <number>
- absolute_min: <number>
- significance_required: <bool>

### Significance Test
<mann_whitney | mean_2sigma>

### Profiler
- tool: <name>
- command: `<full shell command writing to output_path>`
- output_path: `<relative or absolute path>`

### Secrets
<path_reference or "none">

## Pre-flight result
<exit code, observed metric value, determinism notes>

Step 5 — Approval

"Contract drafted. Please review and confirm to start measurement, or request changes."

If the user requests changes, iterate the body. Keep result: pending.

Step 6 — Finalize

Once the user explicitly approves, edit the artifact: change result: pendingresult: approved. That is the only action needed here. The main loop reads the artifact's result: and calls update-status.sh to advance — do NOT call it yourself.

workflow.json· raw config

workflow.json

drives the state machine above