listTeams() deep-cloned ALL team summaries via structuredClone on every call -- even
cache hits and concurrent in-flight awaiters. A heap allocation sample of a launch
put this (listTeams -> cloneTeamSummaries -> structuredClone) as the single largest
memory allocator, driving heap churn + GC pressure during launch (this stand has ~158
teams, and listTeams is called constantly: startup, notification init, task projection,
IPC polls, provisioning).
Build ONE deep-frozen, independent snapshot per uncached load and hand the same
reference to the cache entry, in-flight awaiters, and every later reader. The single
cloneTeamSummaries keeps it independent of any cached config the loader returns;
freezing lets all readers share it safely. Audited every listTeams consumer -- all
iterate / map / filter / serialize, none mutate -- and the freeze turns any stray
future mutation into a loud error rather than silent cross-caller corruption.
TeamConfigReader 26/26 (added a frozen + same-reference regression test), and the
listTeams consumers (TeamDataService 116, CrossTeamService 26) all pass under frozen
summaries.
fileBelongsToTeam streamed the head window via createReadStream + readline. readline's
line iterator runs an expensive Unicode line-break regex and stream/string-decoder
machinery per chunk, which showed up as a top main-thread cost during launch (the line-
split regex alone was ~5.7% in the warm launch profile).
Replace it with a bounded chunked fs.read + a plain '\n' split. JSONL is strictly
newline-delimited and each line is trim()'d (so a trailing CR from CRLF is dropped),
so a '\n' split is cheaper and more correct (it will not split on a bare CR or a
Unicode line/paragraph separator inside a JSON string value, which readline would). A
StringDecoder preserves multi-byte UTF-8 sequences that straddle a chunk boundary.
Byte-identical semantics to the old loop: inspect up to TEAM_AFFINITY_SCAN_LINES
non-empty lines, first match wins via early break, and a final line is honored even
without a trailing newline. Reads in 64KB chunks so a team decided in its first lines
is not penalized by a huge file. Adds tests for CRLF endings + no-trailing-newline,
a multi-byte char straddling the 64KB boundary, and the 40-line window bound (21 pass).
On the live resolution path collectRootJsonlSessionIds already stat()s each root
jsonl for its mtime-window filter, then fileBelongsToTeam stat()ed the very same
file again for its cache validation -- two fs.stat syscalls (plus two Stats
allocations) per file, every poll. fileBelongsToTeam now takes an optional
precomputed stat and the mtime-filter caller passes the stat it already has, so the
file is statted once. Measured 20 files -> 20 stat calls on the mtime path (was ~40).
Using a single stat snapshot is also slightly more consistent than two reads that
could straddle a concurrent write. The other call site (subagent scan) passes no
stat and is unchanged (fileBelongsToTeam stats it itself). Adds a regression test
that a caller-supplied stat is the one recorded in the affinity cache.
fileBelongsToTeam only cached POSITIVE affinity durably; a negative verdict was
re-decided on any change, so during a launch every non-matching transcript in the
project dir that grew (mtime+size change from an active session) was re-streamed
(createReadStream+readline) and re-parsed (up to 40 head lines) on every bootstrap
poll. A live atlas-hq-5 launch profile put this whole subsystem (readline streaming
+ fileBelongsToTeam + line/team matching) at ~31% of main-thread JS, the single
largest launch cost.
A team's first 40 head lines are immutable for an append-only transcript, so a
`false` decided from a FULL inspected window (>= TEAM_AFFINITY_SCAN_LINES) stays
valid while the file only grows. Track headWindowFull on the cache entry and short-
circuit such negatives the same way positives are short-circuited (size >= cached).
Short files (partial window) are still re-scanned on growth, so a team mention that
later lands inside the head window is still detected. A shrink/rewrite (size <
cached) forces a re-scan, identical to the positive path.
Behavior-preserving for affinity correctness (no new false negatives); only removes
redundant re-streams. Adds regression tests for both the durable-negative and the
short-file-flips-to-true cases.
Checkpoint of in-progress work:
- renderer: team messages panel/composer, messagesPanelLogic, teamSlice,
AnimatedHeightReveal plus their tests
- main: runtime process usage-stats caching (ignoreCachedMisses, bounded
eviction), alive-run-id helpers, team watch-scope notify wiring
Note: the getTeamAgentRuntimeSnapshot rssBytes expectation in
TeamAgentLaunchMatrix.safe-e2e is environment-dependent and still red.
TeamProvisioningService imports notifyTeamWatchScopeChanged (added with the
setAliveRunId/deleteAliveRunId helpers) but the export was missing, so a clean
checkout of the branch failed to typecheck. Add the export plus a test; the
call-site wiring stays as in-progress work.
A team launch repeatedly changes the watched target set (new dirs appear), and each
change tore down the chokidar watcher and recreated it over the full target set.
On macOS chokidar uses kqueue with one fd per watched file, so every rebuild
re-opened an fd for EVERY watched file (the large always-watched inbox set plus
scoped dirs). Profiling a 6-member mixed launch showed ~54k open() syscalls dominated
by these rebuilds.
Keep one persistent watcher and apply target-set changes with add()/unwatch() on the
delta only, so a reconcile opens fds for just the newly added dirs. The initial
watcher still uses ignoreInitial for a silent startup baseline, and
emitExistingFilesForNewTargets still backfills files already present in newly added
dirs, so the emitted event surface is unchanged. Because the watcher is no longer
recreated per reconcile, the stale-old-generation and close-throws-during-rebuild
failure modes are gone; their tests are replaced with incremental add/unwatch and
persistent-watcher coverage. All 69 watcher tests pass.
During launch the live-status loop resumes every alive member every audit cycle.
resumeActiveIntervalsForMember runs a synchronous file-lock + full read of every
task file, so for an N-member team with M task files it did N locked passes x M
readFileSync per cycle (e.g. 6 members x 20 task files), blocking the main event
loop. Profiling a 6-member mixed launch showed mutateTeamTasks/withFileLockSync as
a top main-thread cost (~14%).
Add resumeActiveIntervalsForMembers that applies the identical per-member resume
logic against a member set in a single locked pass, and use it in the live-status
loop. Same mutations, but one lock + task read per cycle instead of one per member.
Adds a test covering multi-member resume in one pass.
A team launch creates many directories/files in quick succession (worktrees,
inboxes, session logs), and each addDir/unlinkDir event triggered a full
TeamTaskWatchRegistry reconcile that tore down and recreated the entire chokidar
watcher (re-opening a kqueue fd per watched file on macOS). Profiling a 6-member
mixed-team launch showed kqueue churn (kevent) as a top native cost and watcher
rebuild as the top remaining main-thread JS cost after the transcript fix.
Debounce the event-driven reconcile (250ms) so a burst collapses into one rebuild.
collectTargets re-reads the current directory state and emitExistingFilesForNewTargets
backfills files created before the rebuild, so no change is missed; requestReconcile,
startup, and the periodic 30s reconcile stay immediate. Adds a test asserting a
burst of addDir events yields a single rebuild.
During launch, the bootstrap-wait loop polls each member and, per member, re-read
and re-JSON.parsed the same growing transcript tail (readRecentBootstrapTranscriptOutcome
was the top main-thread JS hotspot at ~21% during bootstrap, ~40% with its helpers).
The same file was parsed once per member per poll.
Memoize the parsed tail by (filePath, mtime, size) in a shared cache so the file is
read + parsed once per change and reused across all members. The per-member filter
and failure/success scan is byte-for-byte the same logic; only the redundant read +
JSON.parse is removed. Cache is bounded (LRU, same cap as the outcome cache) and
invalidated on mtime/size change, matching the existing outcome cache semantics.
Adds a test asserting the tail is parsed once and shared while per-member outcome
detection is unchanged.
The main process watched every team directory under ~/.claude/teams (one shallow
chokidar target per team root, per team inboxes, and per task dir). On macOS this
falls back to kqueue, which needs one fd per watched file, so a workspace with
many teams kept ~1600 descriptors open and made startup and reconcile work scale
with the number of teams on disk.
Scope the team-root and task watching to teams that are running or currently
engaged in the UI. The teams root and every team's inboxes are still watched for
all teams, so cross-team message delivery, the lead inbox->stdin relay, and
notifications are unchanged. Idle teams are static, so dropping their team-root/
task watches is safe; opening a team (getData) or launching it re-adds it via an
immediate watch-scope refresh. The provider falls back to watching every team
when unset, and the EMFILE polling fallback is intentionally left unscoped so a
scope change can never look like a deletion.
Measured on a 162-team workspace: open team fds 1600 -> 730, with team-root
watching restored the moment a team is opened or goes live.
Wire claude-opus-4-8 (+ [1m] variant) through the team model picker
alongside 4.7 and 4.6, point the opus alias label at 4.8, update the
reasoning/long-context predicates, and switch the fast-mode UI
message to mention 4.8.
Collapses the per-member resume scan in getMemberSpawnStatuses into a single
readdir + file lock + pass over the team's tasks. Avoids N x IO when multiple
members become alive at launch. Semantics of the applied-set guard are
preserved 1:1; the single-member API stays as a wrapper around the batch.
readPersistedStatuses каждый раз делал полный sync-scan всех task
JSON под file lock и звал resumeActiveIntervalsForMember для каждого
member с runtimeAlive=true — на больших командах блокировал main
до 8s. Теперь маркируем member как 'resume applied' пока он остаётся
alive, сбрасываем маркер при переходе в not-alive (через
syncMemberTaskActivityForRuntimeTransition и в readPersistedStatuses
loop). Resume остаётся идемпотентным и материализует интервалы из
истории один раз за цикл alive.
Когда снимок liveness возвращает stale_metadata для direct-process
teammate с persisted runtimePid, который реально мёртв — собираем
кандидатов на очистку и сбрасываем их runtimePid/bootstrap-поля
из config.json через двойной чек под guard для запущенных run/launch
state. Это убирает мёртвые pid из последующих snapshot'ов и не
трогает OpenCode/lane-aware/runtime-session-имеющиеся записи.
Дополнительно добавлены targeted-pid liveness check (используется
расширение TeamRuntimeLivenessResolver.targetedProcess) и
shouldUsePersistedLaunchRuntimePidForMetadata, чтобы не подсасывать
устаревший pid в metricsPid для членов с lane-aware конфигурацией.
resolveBootstrapRuntimeEvidenceBoundaryMs учитывает оба источника
времени старта (firstSpawnAcceptedAt и bootstrapExpectedAfter) и
принимает более раннее, если у member и runtime совпадает
bootstrapProofToken + runId. Это лечит случай, когда proof подписан
до того, как app зафиксировал firstSpawnAcceptedAt, но после
bootstrap boundary самого ранкона. Та же логика применена в
isBootstrapMemberEvidenceCurrentForMember для confirmation evidence.
resolveTeamMemberRuntimeLiveness принимает опциональный targetedProcess
с pid + command — если строка процесса проходит team/agent verification,
liveness отмечается как 'runtime_process' даже когда полная process table
не нашла его (например при гонке snapshot vs spawn). Дополнительно для
direct-process backend разрешён fallback по --agent-name, когда команда
запущена без --agent-id.
Распознаём отдельную диагностику для EPERM на создании managed
node_modules symlink под Windows и подсказываем пользователю
запустить приложение от имени Administrator. UI-подсказка и
provisioning hint показываются только для этого случая, обычный
Windows access-denied flow не затрагивается.
The failure.message passed to ensureOpenCodeProfileNodeModulesJunction
comes from normalizeCommandFailure which may produce a JSON-escaped
string when the error contains structured JSON in stdout. Using the
raw runtimeMessage literal causes a mismatch in CI. Switch to
expect.any(String) to accept any string value for the errorMessage
parameter while still verifying the call happens.
- Extract symlink source/target paths directly from the error message
instead of reconstructing them from process.env (Codex P2 review)
- Add extractSymlinkSourcePath and extractSymlinkTargetPath functions
- Update ensureOpenCodeProfileNodeModulesJunction to accept optional
errorMessage parameter and use extracted paths from it
- Fix unused imports in test (remove 'os', replace 'beforeEach' with
'afterEach' per CodeRabbit review)
- Widen fs.statSync mock signatures to use Parameters<typeof fs.statSync>
per CodeRabbit review
- Add tests for new extraction functions
- Pass errorMessage to ensureOpenCodeProfileNodeModulesJunction calls
in CLI client tests
On Windows 10 without Developer Mode, the OpenCode runtime fails to create
a symlink from shared-cache/config-node_modules to the profile's
node_modules directory. The EPERM error blocks the entire OpenCode provider
catalog, leaving it unavailable.
Changes:
- New openCodeWindowsNodeModulesJunction module that pre-creates a Windows
directory junction (no Developer Mode required) before the runtime call
when an EPERM symlink error is detected
- On Windows, loadView and loadProviderDirectory now detect EPERM symlink
errors, extract the profile ID, create the junction, and retry the
runtime command once before falling back to the error response
- Updated diagnostic hints to accurately reflect that the runtime does not
yet include junction fallback, and that the next runtime update will
include it
- Added unit tests for the junction module and retry behavior
* Add KiloCode as a first-class provider with HTTP-based model catalog
Implements KiloCode (kilo.ai gateway) support following repo design principles,
independently of the OpenCode implementation.
Key changes:
- Add 'kilocode' to CliProviderId, TeamProviderId, MemberWorkSyncProviderId
- Create kilocode-model-catalog feature: HTTP client fetching models from
kilo.ai /models endpoint (not /v1/models — different gateway path)
- Add KILO_API_KEY env var for authentication
- Wire kilocode into provider routing, capabilities, and UI labels
- Add 'kilo' brand icon alias in providerBrandIcons (auto-fetches from models.dev)
- KiloCode status is managed via the HTTP gateway, not the multimodel bridge
* Fix: preserve non-bridge providers (kilocode) when updating provider status
The multimodel bridge only returns status for anthropic/codex/gemini/opencode.
When checkAuthStatus replaced result.providers with the bridge response,
kilocode was lost from the provider list and never appeared in the UI.
Now merge bridge providers with the initial list, keeping any provider
not covered by the bridge so kilocode shows up in the Extensions panel.
* Fix: resolve KiloCode status after bridge merge, skip bridge refresh for non-bridge providers
- resolveKilocodeStatus() gives kilocode a settled verificationState:'verified' status
so isHydratedMultimodelProviderStatus() returns true and the loading spinner stops
- Status reflects KILO_API_KEY presence: authenticated+supported when set, else clear message
- fetchCliStatus() now skips fetchCliProviderStatus for non-bridge providers (kilocode)
so the Claude Code CLI is not queried for kilocode, preventing error status overwrites
* Add KiloCode to API key provider system in settings dialog
isApiKeyProviderId now includes kilocode, so the API key form renders
in the Provider Settings dialog instead of showing an empty modal.
Adds KILO_API_KEY config with placeholder and description.
* Fix KiloCode models endpoint: /api/gateway/models per docs
* Fix: short-circuit getProviderStatus/verifyProviderModels for kilocode
The Claude Code CLI only accepts anthropic and codex for --provider.
Calling it with kilocode caused the blinking modal error.
resolveKilocodeProviderStatus() returns status directly from env
without touching the CLI binary — no bridge, no --provider flag.
* Fix: resolveKilocodeProviderStatus reads from app key store via enrichProviderStatus
process.env.KILO_API_KEY was only set for users who configured it in their
shell environment. The UI stores the key in the app's encrypted key store
(ApiKeyService), which enrichProviderStatus checks via hasStoredProviderApiKey.
Now resolveKilocodeProviderStatus() calls providerConnectionService.enrichProviderStatus()
so both the app key store and env var are checked — the same way anthropic/gemini work.
* Wire KiloCode model catalog into provider status — models now load from gateway
- ProviderConnectionService: add setKilocodeModelCatalogFeature() and
enrichKilocodeProviderStatus() which fetches models from the gateway API
and populates provider.models when the API key is configured
- main/index.ts: create KilocodeModelCatalogFeature at startup and inject
it into ProviderConnectionService, same lifecycle as Codex catalog
* Fix: skip Claude CLI probe for kilocode in prepareForProvisioning
The generic probe path calls probeClaudeRuntime with CLAUDE_CODE_ENTRY_PROVIDER=kilocode
which causes the CLI to hang — freezing the Create Team dialog until timeout.
Add an explicit kilocode case that short-circuits to an API key presence check
(via providerConnectionService.getConnectionInfo) without touching the Claude binary,
same pattern as the opencode adapter bypass.
* Fix vitest localStorage fallback
* test(kilocode): update provider visibility expectations
---------
Co-authored-by: 777genius <quantjumppro@gmail.com>
* feat(i18n): add CJK app locale support
* feat(i18n): add Spanish Hindi and Portuguese locales
* feat(i18n): add French Arabic and Bengali locales
* feat(i18n): add Urdu Indonesian and German locales
* feat(i18n): add landing locales for Bengali Urdu and Indonesian
* fix(i18n): address locale review feedback
---------
Co-authored-by: iliya <iliyazelenkog@gmail.com>
- Add avatar trigger mode to MemberSelect for dense toolbar surfaces.
- Render the lead log source selector beside compact sidebar log search and filters.
- Cover toolbar accessory rendering, avatar trigger behavior and lead alias detection.
- Treat checking, deferred and loading provider model catalog states as pending instead of unavailable.
- Show selected provider activity inside create and launch dialogs while keeping ready providers visible during checks.
- Remove the global provider status header so provider activity is scoped to launch flows.
- Carry bootstrap run ids from bootstrap-state into member evidence and compare them with current run identity.
- Allow small confirmation clock skew for delayed Anthropic app acceptance without accepting stale rapid relaunch evidence.
- Clean confirmed bootstrap members that only have stale persisted runtime pid diagnostics.
- Cover process-table unavailable, post-stop stale pid and mixed launch reconcile cases.
Keep connected provider details visible while refreshes are in flight, restore reusable provider status UI, and separate fast startup summaries from heavier provider hydration. Replace the fixed 30s startup wait with an idle-aware scheduler with a 30s safety cap and cover the Electron timer binding crash.
Share watcher fallback behavior across project, todo, team, and task file monitoring. Add polling fallback coverage for watcher-limit and startup failure cases so Linux EMFILE conditions degrade instead of amplifying renderer crashes.