Skip to content
Stagent
jie-worldstatelabs/ui-eval-gatepublic

UI-change workflow with mandatory Playwright-based visual evaluation — agent translates fuzzy goals into machine-verifiable assertions, scores 8 dimensions from real screenshots, loops until pass.

1
by jie-worldstatelabsupdated Apr 27, 20263 stages1 run

Run in Claude Code

/stagent:start --flow=cloud://jie-worldstatelabs/ui-eval-gate <task_description>

Paste in Claude Code and replace <task_description>

Template blueprint

State machine

Loading state machine…

Click any stage above to view its instructions below.

Stagebriefing

briefing.md

inline· interruptible · transitions: approved → executing

Stage: briefing

Runtime config (canonical): workflow.jsonstages.briefing

Purpose: translate the user's fuzzy UI goal into a machine-verifiable contract — repo context, human acceptance list, derived machine assertions, secrets path, signals, thresholds — so downstream stages can implement and evaluate against ground truth instead of vibes. Output artifact: write to the absolute path provided in your I/O context. Valid results this stage writes: pending (briefing in progress, awaiting user approval), approved (user has explicitly confirmed and pre-flight passes).

<HARD-GATE> Do NOT transition out of this stage until BOTH: 1. The user has explicitly approved the `human_acceptance` list, the secrets path, and the entry URL. 2. The pre-flight checklist (Playwright MCP available, target_url reachable or dev_server_start_command set, secrets_status ∈ {provided, not_required}) has passed.

Write result: approved only after both gates clear. </HARD-GATE>

This is an interruptible stage — the stop hook allows natural pauses for Q&A.

You are the main agent driving the briefing dialogue with the user. Read state.md for the current epoch. Immediately write the artifact at the path shown in your I/O context with result: pending so the stop hook knows the stage is in progress. Then run the three-phase dialogue below, iterating until the user approves and pre-flight passes; finally rewrite the artifact with result: approved.

Dialogue protocol — three phases

The user stays in natural language. You do the translation.

Phase 1 — Discover (silent, ~2 min)

Before asking anything, scan the project to ground every later question in real code. Use Glob, Grep, and Read only — do not run the app yet.

Extract:

  • Framework: Next.js / Vite+React / SvelteKit / Astro / Remix / Nuxt / plain HTML — read package.json, top-level configs.
  • Design system: shadcn/ui? Radix? MUI? Chakra? Mantine? Tailwind-only? Custom? — grep for @radix-ui, @mui, shadcn, tailwind.config.*, components/ui/.
  • Component inventory summary: glance at components/, app/, pages/, src/ — list the 5–10 most-edited or most-imported components.
  • Design tokens summary: read tailwind.config.* (colors, spacing, fontSize, radius, shadow), CSS custom properties in globals.css / index.css, theme files.

Compress what you found into 4–6 lines you'll cite back when proposing options. Do not dump file lists at the user.

Phase 2 — Translate (interactive, the conversation core)

Use the Discover findings to design forced-choice questions on real stack-level options. Avoid abstract design talk.

Tactics:

  • One question per message. Multiple choice (A/B/C/D — multi-select allowed) when possible.
  • Ground every option in something you literally saw in the repo. Example, after detecting shadcn/ui + Tailwind: "现代化在你这套 shadcn 配置里通常意味着 (a) 圆角加到 rounded-2xl (b) 字重从 font-semibold 降到 font-medium (c) 主色饱和度降低 (d) 区块 padding 翻倍。哪几项贴近?"
  • Typical 3–6 questions. Stop when you can write a concrete assertion list.
  • Anchor goals to specific selectors / pages / components. "这次只改 app/(marketing)/page.tsx 的 hero,还是整套首页?"

Internally — and without dumping the table at the user — translate fuzzy descriptors into measurable axes using mappings like:

Fuzzy descriptorMeasurable axes
现代 / 高端radius ↑, shadow softer, font-weight ↓, saturation ↓, whitespace ↑
紧凑 / 高密度padding ↓, line-height ↓, font-size ↓ small step
专业 / 商务serif headline OR neutral sans, saturation ↓, whitespace ↑
活泼 / 友好accent saturation ↑, radius ↑, illustration / emoji-tier accents
极简palette → 2–3 hues, remove decorative borders, max-width content column

Translate the user's selections + concrete numbers (e.g. "圆角 16px","主标题 32 → 28px") into a machine_assertions[] list of the shape:

verify_method is one of:

  • browser_evaluate — runs JS in the page, returns a value the evaluator compares to expected.
  • vision — qualitative check the evaluator performs by looking at the screenshot it took (only when no DOM-measurable proxy exists, e.g. "logo placement feels balanced"). Use sparingly — every vision assertion is a partial escape hatch.

Phase 3 — Negotiate secrets (interactive)

Re-scan the repo for secret usage:

  • process.env.* references (in TS/JS) and equivalents.
  • .env.example, .env.local.example, env.d.ts.
  • Login / auth flow files: NextAuth/Auth.js config, Clerk/Supabase clients, custom login pages.
  • Third-party SDKs requiring keys (Stripe pk/sk, Sentry DSN, analytics).

Build the secrets_required list — only key names, why needed, source hint. Never write a value.

Then pick a path with the user. Default: ~/.config/stagent/secrets/<suffix>.env (the workflow suffix is ui-eval-gate, but if the user has multiple projects they may prefer a project-scoped path). Confirm absolute path.

Pick a fill mode with the user:

  • 口述 — user reads values to you in chat; you Write them to the secrets file (still NEVER echo them in artifact body).
  • 模板自填 — you Write a placeholder template (KEY=) to the secrets file, the user fills it themselves outside the chat, then types done here so you re-read and confirm presence (not values).

Set secrets_status:

  • provided — every required key is present in the file (you may grep ^KEY=.+, but DO NOT print the value).
  • not_required — there are no required secrets for this evaluation.
  • template_pending is NOT allowed to leave briefing — block on the user.

Login plan

Inspect login surface and pick login.type:

  • none — target_url is publicly reachable.
  • simple_form — username/password form on a login page; record {login_url, username_selector, password_selector, submit_selector, post_login_url_pattern} and read username/password from the secrets file at evaluator runtime.
  • storage_state — OAuth / 2FA / captcha detected → instruct user to do a one-time manual login in a fresh browser, export storage_state.json (Playwright recipe: npx playwright codegen --save-storage=...), record absolute path in secrets file. Do NOT introduce LLM-driven browsers (Browser Use etc.) — keep evaluator deterministic.

Pre-flight checklist (run at the very end of the briefing, before flipping to approved)

Run these and report results inline. Any failure → fix or document → keep result: pending.

  1. Playwright MCP available — confirm mcp__playwright__browser_navigate, browser_resize, browser_take_screenshot are loaded in this session. If not, instruct user to enable the Playwright MCP plugin and re-confirm before approving.
  2. target_url reachablecurl -sS -o /dev/null -w "%{http_code}\n" "<target_url>" ⇒ 2xx/3xx. If it errors, ask the user whether to record dev_server_start_command for the evaluator to start, or block until the user starts it manually.
  3. secrets file readabletest -f "$SECRETS_FILE" && test -r "$SECRETS_FILE". If secrets_required is non-empty, grep -c '^[A-Z_][A-Z0-9_]*=.\+$' "$SECRETS_FILE" ≥ count of required keys. Never print the values.

Briefing artifact shape

Write the output artifact (use the current epoch from state.md):

markdown
---
epoch: <epoch>
result: pending  # flip to `approved` only after user confirms AND pre-flight passes
---
# Briefing — <Topic>

## Target URL
<absolute http(s) URL>

## Dev Server Start Command
<shell command, or "null">

## Repo Context
- framework: <...>
- design_system: <...>
- component_inventory_summary: <one-line list>
- design_tokens_summary: <key tokens that matter for this change>

## Human Acceptance (user-approved)
- [ ] <plain-language fact 1, with concrete value/element>
- [ ] <plain-language fact 2>

## Machine Assertions (derived — do NOT show user)
```jsonc
[
  { "id": "...", "human_ref": "...", "verify_method": "browser_evaluate", "verify_args": {...}, "expected": "...", "viewport": "desktop" }
]
```

## Secrets
- secrets_file_path: `<absolute path>`
- secrets_required: `[{key, why_needed, source_hint}, ...]`  # NO values
- secrets_status: `provided` | `not_required`

## Login
```jsonc
{ "type": "none" | "simple_form" | "storage_state", "details": { ... } }
```

## Signals (which conditional dimensions to score)
```jsonc
{ "lighthouse": false, "performance": false, "console_zero_tolerance": true }
```

## Threshold
```jsonc
{ "total_min": 60, "brief_adherence_min": 6 }
```

## Pre-flight Results
- Playwright MCP: ✅ / ❌ <details>
- target_url reachable: ✅ / ❌ <http code>
- secrets file: ✅ / ❌ <required-key count vs found>

result: pending signals "briefing drafted but not yet approved or not yet pre-flight clean."

Get user approval

"Briefing saved to the session's briefing-report.md. Please review the Human Acceptance list and the Secrets plan, then confirm or request changes."

If the user requests changes, iterate inside the artifact body and the underlying machine_assertions — keep result: pending. Do NOT show the raw machine_assertions block to the user unless they explicitly ask; the human_acceptance list is what they audit.

Finalize

Once the user explicitly approves AND every pre-flight item is ✅, edit the artifact: change result: pendingresult: approved.

That is the only action needed here. The SKILL.md main loop's step (e) reads the artifact's result: and calls update-status.sh to advance the state machine — do NOT call it yourself from this stage file.

Rules

  • NEVER write secret values into the artifact. Only key names, why, source hints, and the file path.
  • NEVER pre-emptively transition; the HARD-GATE above is the only path to approved.
  • Keep the user in natural language; the JSON machine_assertions block is for downstream stages, not for user review.
  • If the user gives a goal so vague that you cannot derive any measurable assertion after Phase 2, ask one more anchoring question rather than padding the assertion list with vision-only items.
workflow.json· raw config

workflow.json

drives the state machine above