12 KiB
Debugging Agent Teams
Use this runbook when a team launch hangs, a teammate is marked registered or failed_to_start, messages do not appear, or OpenCode participants look online but do not answer.
First Rule
Do not guess from the UI alone. Always correlate:
- UI diagnostics copied from the launch/member detail panel
- persisted team files under
~/.claude/teams/<teamName>/ - live process table
- runtime-specific evidence, especially OpenCode lane manifests
Key Files
Team root:
TEAM="<team-name>"
TEAM_DIR="$HOME/.claude/teams/$TEAM"
TASKS_DIR="$HOME/.claude/tasks/$TEAM"
Important files and folders:
config.json- configured members, provider/model selection, project pathmembers.meta.json- member metadata, removed members, worktree settings if presentlaunch-state.json- current app-side truth for member launch/livenessbootstrap-state.json- bootstrap phase summary when presentbootstrap-journal.jsonl- ordered bootstrap events from the CLI/runtimeinboxes/*.json- durable inbox messages for user, lead, and native teammatessentMessages.json- app-side sent-message records$TASKS_DIR/*.json- task board state.opencode-runtime/lanes.json- OpenCode lane index.opencode-runtime/lanes/<encoded-lane-id>/manifest.json- lane-scoped runtime store manifest.opencode-runtime/lanes/<encoded-lane-id>/opencode-sessions.json- committed OpenCode session evidence
Quick inspection:
jq '.teamLaunchState, .summary, .members' "$TEAM_DIR/launch-state.json"
jq '.lanes' "$TEAM_DIR/.opencode-runtime/lanes.json" 2>/dev/null
find "$TEAM_DIR/.opencode-runtime" -maxdepth 3 -type f | sort
tail -80 "$TEAM_DIR/bootstrap-journal.jsonl" 2>/dev/null
Launch Phases
Primary launch and OpenCode secondary lanes are different paths.
- Primary CLI members are created by the main provisioning process.
- OpenCode secondary members are launched as side lanes after primary filesystem readiness.
- Missing
inboxes/<opencode-member>.jsonis not automatically a launch bug. OpenCode side lanes do not have to be primary inbox-created before they start. - The UI can show the team still launching while primary members are already usable, because "all teammates joined" waits for secondary lanes too.
When a launch hangs at Prepared communication channels for X/Y members, check whether Y incorrectly includes secondary OpenCode members. The filesystem monitor should wait for effectiveMembers, not every requested member.
Teammate Runtime Debug Mode
Desktop launches use the app-managed process backend by default. That is the supported default for normal app launches because the app owns the process lifecycle, runtime logs, cleanup, and bootstrap evidence.
Live Smoke Runtime Launcher
Live/dev smoke tests should use the orchestrator source launcher by default:
CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli-source \
pnpm vitest run --maxWorkers 1 --minWorkers 1 test/main/services/team/AnthropicLaunchSelection.live.test.ts
cli-source runs src/entrypoints/cli.tsx directly through Bun. Use it while developing launch/runtime code so the smoke test cannot accidentally pass or fail against a stale dist/local-cli/cli.js bundle.
For release or production-like smoke checks, test the built wrapper explicitly:
cd /Users/belief/dev/projects/claude/agent_teams_orchestrator
bun run build
cd /Users/belief/dev/projects/claude/claude_team
CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli \
pnpm vitest run --maxWorkers 1 --minWorkers 1 test/main/services/team/AnthropicLaunchSelection.live.test.ts
The built wrapper cli reads dist/local-cli/cli.js. cli-dev reads dist/local-cli-dev/cli.js; it is useful for dev-bundle checks, but it is not the production wrapper.
For local debugging, force pane-backed teammates through tmux:
CLAUDE_TEAM_TEAMMATE_MODE=tmux pnpm dev
For a single launch from the UI, add this to custom CLI args:
--teammate-mode tmux
Expected behavior:
tmuxmode should removeCLAUDE_TEAM_FORCE_PROCESS_TEAMMATESfrom the launch env.- The desktop app should pass
--teammate-mode tmuxto the runtime CLI. - The orchestrator should report
backend_type: "tmux"andtmux_pane_idlike%1. - If
tmuxis unavailable, the launch dialog should block explicit tmux mode with a tmux readiness message.
Use this mode to inspect interactive CLI behavior, terminal prompts, and pane output. Do not treat it as equivalent to the process backend for recovery semantics; persisted pane IDs can help discovery, but app restart does not make old panes a fully app-owned runtime again.
Live Smoke Runtime Launcher
Live/dev smoke checks should run the orchestrator from source unless the test explicitly says it is
validating packaged output. This keeps app smoke tests aligned with the source tree and avoids a stale
dist bundle hiding runtime changes.
Default live/dev smoke launcher:
export CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli-source
The source launcher executes src/entrypoints/cli.tsx through Bun. It is the right default for local
debug loops, live model/provider checks, and cross-repo runtime fixes.
It normalizes inherited NODE_ENV=production to NODE_ENV=development, because source smoke is a
dev/runtime validation path. If you need production semantics, run the release smoke path below.
Local live/prove scripts should resolve their default CLI through scripts/lib/live-smoke-runtime.mjs,
which points at cli-source unless CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH is explicitly set.
Source-mode teammate startup can be slower than bundled startup, so live smoke harnesses may set
CLAUDE_TEAM_PROCESS_RUNTIME_READY_TIMEOUT_MS and
CLAUDE_TEAM_PROCESS_INBOX_POLLER_READY_TIMEOUT_MS to larger values when they are validating source
behavior instead of watchdog latency.
Release or production-like smoke checks must validate the built wrapper:
cd /Users/belief/dev/projects/claude/agent_teams_orchestrator
bun run build
export CLAUDE_AGENT_TEAMS_ORCHESTRATOR_CLI_PATH=/Users/belief/dev/projects/claude/agent_teams_orchestrator/cli
cli reads dist/local-cli/cli.js. cli-dev reads dist/local-cli-dev/cli.js, so a passing
cli-dev smoke is not proof that the production wrapper is fresh.
Member State Meanings
Common launch-state.json cases:
confirmed_alivewithbootstrapConfirmed: true- member is usable.registered/runtime_pending_bootstrap- process or lane exists, but bootstrap proof is not committed yet.registered_only- app has persisted metadata, but no live runtime proof.runtime_process_candidate- process/session was observed, but committed runtime evidence is incomplete or pending.failed_to_startwithruntime_process- a process exists, but the launch gate still failed. Inspect diagnostics and runtime evidence.failed_to_startwithstale_metadata- persisted pid/session is old or dead.
Do not treat member_briefing alone as runtime evidence. For OpenCode, the authoritative proof is committed bootstrap/session evidence in the lane runtime store.
OpenCode Debug Flow
For an OpenCode teammate:
MEMBER="<member-name>"
jq --arg member "$MEMBER" '.members[$member]' "$TEAM_DIR/launch-state.json"
jq '.lanes' "$TEAM_DIR/.opencode-runtime/lanes.json" 2>/dev/null
find "$TEAM_DIR/.opencode-runtime/lanes" -maxdepth 3 -type f | sort
Expected healthy OpenCode lane:
lanes.jsonhas the lane stateactive- lane
manifest.jsonhasactiveRunId - lane manifest has at least one runtime evidence entry, usually
opencode.sessionStore - lane directory has
opencode-sessions.json launch-state.jsonmember hasruntimeRunId,runtimeSessionId, andbootstrapConfirmed: true
If the bridge says bootstrap succeeded but the manifest has entries: [], the issue is evidence commit, not model behavior. The member must not be considered deliverable until opencode-sessions.json and its manifest entry exist.
OpenCode bridge ledger, if needed:
LEDGER="$HOME/Library/Application Support/claude-agent-teams-ui/opencode-bridge/command-ledger.json"
jq --arg team "$TEAM" '.data[] | select(.teamName == $team)' "$LEDGER" 2>/dev/null
Live process checks:
pgrep -af "opencode serve"
ps -p <pid> -o pid,ppid,etime,command
Do not kill all OpenCode processes as a debugging shortcut. First identify whether the pid belongs to the current team/lane. Some OpenCode temp libopentui.dylib files are held by live opencode serve processes and should only be cleaned after those processes are stopped.
Messaging Debug Flow
Lead and teammates use different delivery paths:
- Lead reads stdin. Messages to lead go through
relayLeadInboxMessages(). - Native teammates read their inbox files directly.
- OpenCode teammates receive prompts through runtime delivery and must reply via
agent-teams_message_send. - Teammate-to-user replies should appear in
inboxes/user.jsonor app sent-message projections.
If a notification appears but the Messages UI does not show it:
jq '.' "$TEAM_DIR/inboxes/user.json" 2>/dev/null
jq '.' "$TEAM_DIR/sentMessages.json" 2>/dev/null
Check from, to, messageId, relayOfMessageId, and taskRefs. Unknown authors should be rejected or normalized at the write boundary, not silently rendered as fake teammates.
For OpenCode "message saved but not delivered" cases, inspect the OpenCode prompt-delivery ledger and response proof. Do not synthesize visible replies in the frontend.
Task And Work-Stall Debug Flow
For task stalls:
TASK="<short-or-full-task-id>"
rg -n "$TASK" "$TASKS_DIR" "$TEAM_DIR/inboxes" "$TEAM_DIR/bootstrap-journal.jsonl" 2>/dev/null
Important distinctions:
- Delivery proof means the agent received the message.
- Task progress proof means the agent made meaningful task progress.
- A weak comment like "starting work" is not strong progress.
task_add_commentshould be evaluated from the actual persisted comment text, not only from the tool call.
Task-stall monitor defaults:
- General task-stall monitor is for all agents.
- OpenCode direct remediation is provider-specific and should nudge the OpenCode owner first.
- If OpenCode remediation is not accepted, fallback to lead alert.
- Watchdog/remediation must not auto-start new OpenCode processes.
Task Log Stream Debug Flow
Task Log Stream is a projection, not a separate source of truth.
For OpenCode tasks, a healthy stream should show native tool rows such as read, bash, edit, write, plus Agent Teams MCP rows. If it only shows agent-teams_* calls:
- confirm the task has OpenCode attribution for the member/session
- confirm the OpenCode transcript contains native tools inside the bounded task window
- check whether the task was assigned after the native work happened
- do not widen attribution so far that unrelated session work is pulled into the task
If Changes says "No file changes recorded" while native write/edit rows exist, inspect the ledger/backfill path. Task logs can show runtime tools even when .board-task-changes/** was not created.
Safe Fix Checklist
Before changing launch or runtime logic:
- Preserve stale-run, tombstone, stopped-team, and removed-member guards.
- Do not make
member_briefingruntime evidence. - Do not make delivery/watchdog auto-launch a fresh OpenCode lane.
- Keep primary launch readiness separate from secondary OpenCode lane readiness.
- Keep runtime evidence lane-scoped. Never let one OpenCode lane satisfy another lane.
- Add a regression test for the exact state shape you found in
launch-state.json.
Recommended verification:
pnpm test -- test/main/services/team/TeamProvisioningService.test.ts
pnpm test -- test/main/services/team/TeamAgentLaunchMatrix.safe-e2e.test.ts
pnpm typecheck
git diff --check
Use narrower test commands first when editing a focused path, then run the broader suite that covers launch, delivery, and liveness.