777genius d5894c029d feat(team): improve runtime bootstrap controls

2026-05-19 22:39:13 +03:00

12 KiB

Raw Blame History

Debugging Agent Teams

Use this runbook when a team launch hangs, a teammate is marked registered or failed_to_start, messages do not appear, or OpenCode participants look online but do not answer.

First Rule

Do not guess from the UI alone. Always correlate:

UI diagnostics copied from the launch/member detail panel
persisted team files under ~/.claude/teams/<teamName>/
live process table
runtime-specific evidence, especially OpenCode lane manifests

Key Files

Team root:

TEAM="<team-name>"
TEAM_DIR="$HOME/.claude/teams/$TEAM"
TASKS_DIR="$HOME/.claude/tasks/$TEAM"

Important files and folders:

config.json - configured members, provider/model selection, project path
members.meta.json - member metadata, removed members, worktree settings if present
launch-state.json - current app-side truth for member launch/liveness
bootstrap-state.json - bootstrap phase summary when present
bootstrap-journal.jsonl - ordered bootstrap events from the CLI/runtime
inboxes/*.json - durable inbox messages for user, lead, and native teammates
sentMessages.json - app-side sent-message records
$TASKS_DIR/*.json - task board state
.opencode-runtime/lanes.json - OpenCode lane index
.opencode-runtime/lanes/<encoded-lane-id>/manifest.json - lane-scoped runtime store manifest
.opencode-runtime/lanes/<encoded-lane-id>/opencode-sessions.json - committed OpenCode session evidence

Quick inspection:

jq '.teamLaunchState, .summary, .members' "$TEAM_DIR/launch-state.json"
jq '.lanes' "$TEAM_DIR/.opencode-runtime/lanes.json" 2>/dev/null
find "$TEAM_DIR/.opencode-runtime" -maxdepth 3 -type f | sort
tail -80 "$TEAM_DIR/bootstrap-journal.jsonl" 2>/dev/null

Launch Phases

Primary launch and OpenCode secondary lanes are different paths.

Primary CLI members are created by the main provisioning process.
OpenCode secondary members are launched as side lanes after primary filesystem readiness.
Missing inboxes/<opencode-member>.json is not automatically a launch bug. OpenCode side lanes do not have to be primary inbox-created before they start.
The UI can show the team still launching while primary members are already usable, because "all teammates joined" waits for secondary lanes too.

When a launch hangs at Prepared communication channels for X/Y members, check whether Y incorrectly includes secondary OpenCode members. The filesystem monitor should wait for effectiveMembers, not every requested member.

Teammate Runtime Debug Mode

Desktop launches use the app-managed process backend by default. That is the supported default for normal app launches because the app owns the process lifecycle, runtime logs, cleanup, and bootstrap evidence.

Live Smoke Runtime Launcher

Live/dev smoke tests should use the orchestrator source launcher by default:

CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli-source \
  pnpm vitest run --maxWorkers 1 --minWorkers 1 test/main/services/team/AnthropicLaunchSelection.live.test.ts

cli-source runs src/entrypoints/cli.tsx directly through Bun. Use it while developing launch/runtime code so the smoke test cannot accidentally pass or fail against a stale dist/local-cli/cli.js bundle.

For release or production-like smoke checks, test the built wrapper explicitly:

cd /Users/belief/dev/projects/claude/agent_teams_orchestrator
bun run build
cd /Users/belief/dev/projects/claude/claude_team
CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli \
  pnpm vitest run --maxWorkers 1 --minWorkers 1 test/main/services/team/AnthropicLaunchSelection.live.test.ts

The built wrapper cli reads dist/local-cli/cli.js. cli-dev reads dist/local-cli-dev/cli.js; it is useful for dev-bundle checks, but it is not the production wrapper.

For local debugging, force pane-backed teammates through tmux:

CLAUDE_TEAM_TEAMMATE_MODE=tmux pnpm dev

For a single launch from the UI, add this to custom CLI args:

--teammate-mode tmux

Expected behavior:

tmux mode should remove CLAUDE_TEAM_FORCE_PROCESS_TEAMMATES from the launch env.
The desktop app should pass --teammate-mode tmux to the runtime CLI.
The orchestrator should report backend_type: "tmux" and tmux_pane_id like %1.
If tmux is unavailable, the launch dialog should block explicit tmux mode with a tmux readiness message.

Use this mode to inspect interactive CLI behavior, terminal prompts, and pane output. Do not treat it as equivalent to the process backend for recovery semantics; persisted pane IDs can help discovery, but app restart does not make old panes a fully app-owned runtime again.

Live Smoke Runtime Launcher

Live/dev smoke checks should run the orchestrator from source unless the test explicitly says it is validating packaged output. This keeps app smoke tests aligned with the source tree and avoids a stale dist bundle hiding runtime changes.

Default live/dev smoke launcher:

export CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli-source

The source launcher executes src/entrypoints/cli.tsx through Bun. It is the right default for local debug loops, live model/provider checks, and cross-repo runtime fixes. It normalizes inherited NODE_ENV=production to NODE_ENV=development, because source smoke is a dev/runtime validation path. If you need production semantics, run the release smoke path below. Local live/prove scripts should resolve their default CLI through scripts/lib/live-smoke-runtime.mjs, which points at cli-source unless CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH is explicitly set. Source-mode teammate startup can be slower than bundled startup, so live smoke harnesses may set CLAUDE_TEAM_PROCESS_RUNTIME_READY_TIMEOUT_MS and CLAUDE_TEAM_PROCESS_INBOX_POLLER_READY_TIMEOUT_MS to larger values when they are validating source behavior instead of watchdog latency.

Release or production-like smoke checks must validate the built wrapper:

cd /Users/belief/dev/projects/claude/agent_teams_orchestrator
bun run build
export CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli

cli reads dist/local-cli/cli.js. cli-dev reads dist/local-cli-dev/cli.js, so a passing cli-dev smoke is not proof that the production wrapper is fresh.

Member State Meanings

Common launch-state.json cases:

confirmed_alive with bootstrapConfirmed: true - member is usable.
registered / runtime_pending_bootstrap - process or lane exists, but bootstrap proof is not committed yet.
registered_only - app has persisted metadata, but no live runtime proof.
runtime_process_candidate - process/session was observed, but committed runtime evidence is incomplete or pending.
failed_to_start with runtime_process - a process exists, but the launch gate still failed. Inspect diagnostics and runtime evidence.
failed_to_start with stale_metadata - persisted pid/session is old or dead.

Do not treat member_briefing alone as runtime evidence. For OpenCode, the authoritative proof is committed bootstrap/session evidence in the lane runtime store.

OpenCode Debug Flow

For an OpenCode teammate:

MEMBER="<member-name>"
jq --arg member "$MEMBER" '.members[$member]' "$TEAM_DIR/launch-state.json"
jq '.lanes' "$TEAM_DIR/.opencode-runtime/lanes.json" 2>/dev/null
find "$TEAM_DIR/.opencode-runtime/lanes" -maxdepth 3 -type f | sort

Expected healthy OpenCode lane:

lanes.json has the lane state active
lane manifest.json has activeRunId
lane manifest has at least one runtime evidence entry, usually opencode.sessionStore
lane directory has opencode-sessions.json
launch-state.json member has runtimeRunId, runtimeSessionId, and bootstrapConfirmed: true

If the bridge says bootstrap succeeded but the manifest has entries: [], the issue is evidence commit, not model behavior. The member must not be considered deliverable until opencode-sessions.json and its manifest entry exist.

OpenCode bridge ledger, if needed:

LEDGER="$HOME/Library/Application Support/claude-agent-teams-ui/opencode-bridge/command-ledger.json"
jq --arg team "$TEAM" '.data[] | select(.teamName == $team)' "$LEDGER" 2>/dev/null

Live process checks:

pgrep -af "opencode serve"
ps -p <pid> -o pid,ppid,etime,command

Do not kill all OpenCode processes as a debugging shortcut. First identify whether the pid belongs to the current team/lane. Some OpenCode temp libopentui.dylib files are held by live opencode serve processes and should only be cleaned after those processes are stopped.

Messaging Debug Flow

Lead and teammates use different delivery paths:

Lead reads stdin. Messages to lead go through relayLeadInboxMessages().
Native teammates read their inbox files directly.
OpenCode teammates receive prompts through runtime delivery and must reply via agent-teams_message_send.
Teammate-to-user replies should appear in inboxes/user.json or app sent-message projections.

If a notification appears but the Messages UI does not show it:

jq '.' "$TEAM_DIR/inboxes/user.json" 2>/dev/null
jq '.' "$TEAM_DIR/sentMessages.json" 2>/dev/null

Check from, to, messageId, relayOfMessageId, and taskRefs. Unknown authors should be rejected or normalized at the write boundary, not silently rendered as fake teammates.

For OpenCode "message saved but not delivered" cases, inspect the OpenCode prompt-delivery ledger and response proof. Do not synthesize visible replies in the frontend.

Task And Work-Stall Debug Flow

For task stalls:

TASK="<short-or-full-task-id>"
rg -n "$TASK" "$TASKS_DIR" "$TEAM_DIR/inboxes" "$TEAM_DIR/bootstrap-journal.jsonl" 2>/dev/null

Important distinctions:

Delivery proof means the agent received the message.
Task progress proof means the agent made meaningful task progress.
A weak comment like "starting work" is not strong progress.
task_add_comment should be evaluated from the actual persisted comment text, not only from the tool call.

Task-stall monitor defaults:

General task-stall monitor is for all agents.
OpenCode direct remediation is provider-specific and should nudge the OpenCode owner first.
If OpenCode remediation is not accepted, fallback to lead alert.
Watchdog/remediation must not auto-start new OpenCode processes.

Task Log Stream Debug Flow

Task Log Stream is a projection, not a separate source of truth.

For OpenCode tasks, a healthy stream should show native tool rows such as read, bash, edit, write, plus Agent Teams MCP rows. If it only shows agent-teams_* calls:

confirm the task has OpenCode attribution for the member/session
confirm the OpenCode transcript contains native tools inside the bounded task window
check whether the task was assigned after the native work happened
do not widen attribution so far that unrelated session work is pulled into the task

If Changes says "No file changes recorded" while native write/edit rows exist, inspect the ledger/backfill path. Task logs can show runtime tools even when .board-task-changes/** was not created.

Safe Fix Checklist

Before changing launch or runtime logic:

Preserve stale-run, tombstone, stopped-team, and removed-member guards.
Do not make member_briefing runtime evidence.
Do not make delivery/watchdog auto-launch a fresh OpenCode lane.
Keep primary launch readiness separate from secondary OpenCode lane readiness.
Keep runtime evidence lane-scoped. Never let one OpenCode lane satisfy another lane.
Add a regression test for the exact state shape you found in launch-state.json.

Recommended verification:

pnpm test -- test/main/services/team/TeamProvisioningService.test.ts
pnpm test -- test/main/services/team/TeamAgentLaunchMatrix.safe-e2e.test.ts
pnpm typecheck
git diff --check

Use narrower test commands first when editing a focused path, then run the broader suite that covers launch, delivery, and liveness.

12 KiB Raw Blame History

Debugging Agent Teams

First Rule

Key Files

Launch Phases

Teammate Runtime Debug Mode

Live Smoke Runtime Launcher

Live Smoke Runtime Launcher

Member State Meanings

OpenCode Debug Flow

Messaging Debug Flow

Task And Work-Stall Debug Flow

Task Log Stream Debug Flow

Safe Fix Checklist

12 KiB

Raw Blame History