agent-ecosystem/docs/research/get-team-data-parallel-read-plan.md

35 KiB

getTeamData Parallel Read Plan

Goal

Reduce team:getData latency by parallelizing only the safe top-level read phase of getTeamData() without changing:

  • TeamData payload shape
  • message merge semantics
  • dedup semantics
  • warning semantics
  • renderer/main truth boundaries
  • launch/process reconciliation behavior

Primary target:

  • repeated visible team detail refreshes

Not a target for this patch:

  • new IPC endpoints
  • renderer refresh changes
  • new caches
  • message payload slimming
  • lower-pipeline refactors

Why This Requires Extra Care

This touches a guarded hot zone:

  • main process / IPC / heavy team payload assembly

Relevant guardrails:

  • preserve state truth and reconciliation
  • avoid heavier monolithic snapshot paths
  • preserve observability
  • prefer the smallest invariant-preserving change

Main failure modes to avoid:

  • one async failure aborts the whole snapshot
  • warning order becomes nondeterministic
  • message merge order changes
  • completion-time diagnostics become misleading
  • side-effecting reads run concurrently with unrelated file-heavy reads
  • naive parallelism creates enough disk contention to erase the perf gain
  • the patch quietly assumes tests cover behavior they currently do not cover
  • step wrappers silently mask wiring bugs because the wrapped closure does too much work
  • nested settlement layers make the control flow harder to audit
  • presenceIndexPromise rejection semantics change by accident
  • shared mutable fallback objects leak across steps or tests
  • diagnostics state is mutated from concurrent branches and becomes harder to audit
  • queued heavy work starts reader I/O before a limiter slot is actually granted
  • a queued heavy step records timing metadata at queue time instead of settle time
  • fallback factories become mini control-flow hooks and start masking programming bugs
  • warning text generation drifts from today because it becomes dynamic instead of literal

Relevant code:

Hard Non-Goals

This patch must not:

  • parallelize readProcesses()
  • parallelize resolveMembers()
  • parallelize getMemberAdvisories(...)
  • parallelize enrichMemberBranches(...)
  • change the post-load message pipeline
  • change warning strings
  • change warning order
  • change any renderer store logic
  • change any InboxMessage or TeamData semantics
  • move concurrency into TeamTaskReader or TeamInboxReader

This patch must also not opportunistically fix adjacent issues such as:

  • reusing listInboxNames() inside getMessages()
  • introducing a single-team worker path for tasks
  • caching lead-session directory listings

Those are separate patches.

Additional rule:

  • keep all new concurrency orchestration local to getTeamData()

Why:

  • moving concurrency into lower-level readers would widen the blast radius
  • it would make it harder to reason about total filesystem pressure from this patch

Dependency Graph

The patch must preserve this dependency graph exactly.

Strict prerequisites

  • config must complete before:
    • extractLeadSessionTexts(config)
    • any downstream assembly

Additional rule:

  • if config is missing, getTeamData() must still throw before any read-phase step promises are created

Why:

  • this preserves the current fast-fail behavior
  • it avoids accidental background work for nonexistent teams

Safe early-start paths

After config:

  • presenceIndexPromise
  • tasks
  • inboxNames
  • messages
  • leadTexts
  • sentMessages
  • metaMembers
  • kanbanState

Assembly prerequisites

resolveMembers(...) requires all of:

  • config
  • metaMembers
  • inboxNames
  • tasksWithKanban
  • final assembled messages

tasksWithKanban requires:

  • tasks
  • kanbanState
  • presenceIndexPromise

presenceIndexPromise itself must preserve current semantics:

  • it starts early
  • it is awaited only after tasksWithKanbanBase
  • if it rejects today, getTeamData() rejects

Rule:

  • do not wrap, settle, catch, delay, or parallelize away that rejection behavior in this patch

Explicit downstream-only paths

These must stay after assembly:

  • getMemberAdvisories(...)
  • enrichMemberBranches(...)
  • readProcesses(teamName)

Why this section exists:

  • the patch is only safe if dependency edges remain explicit
  • parallelization work tends to hide dependency mistakes behind “it still compiled”

Key Observations

1. processes is not a pure read

readProcesses() looks harmless, but controller.processes.listProcesses() can mark dead PIDs and synchronously rewrite processes.json.

Relevant code:

Rule:

  • keep processes outside the parallel read phase

2. Some top-level readers already create their own fan-out

TeamInboxReader.getMessages() already performs bounded concurrent inbox reads.

Relevant code:

Implication:

  • do not start every heavyweight reader at once
  • control top-level heavy-step concurrency explicitly

3. listInboxNames() is duplicated work today

getTeamData() calls listInboxNames() directly, and getMessages() calls it again internally.

Relevant code:

Implication:

  • phase 1 will still duplicate one inbox-directory listing
  • that is acceptable for this patch
  • do not widen the patch by coupling these readers

4. Warning order is currently deterministic

Current semantic order:

  1. tasks
  2. inbox names
  3. messages
  4. lead session texts
  5. sent messages
  6. member metadata
  7. kanban state
  8. runtime advisories
  9. processes

Rule:

  • preserve this order even if reads complete out of order

Additional rule:

  • preserve warning presence and wording exactly
  • do not deduplicate warnings in this patch

Additional note:

  • presenceIndexPromise is intentionally not part of this warning sequence
  • phase 1 must not introduce a new warning for presence loading

5. The lower pipeline is semantically sensitive

After the read phase, getTeamData() performs:

  • message merge
  • lead-session/live dedup
  • session-id enrichment
  • slash-result annotation
  • kanban overlay
  • member resolution

Rule:

  • do not reorder, parallelize, or “clean up” this lower pipeline in phase 1

5a. kanbanGc is part of current diagnostics shape even though no GC runs here

getTeamData() currently records mark('kanbanGc') immediately after kanban load without performing garbage collection.

Relevant code:

Rule:

  • keep this mark where it is in phase 1
  • do not “clean it up” while touching perf code

Why:

  • changing it mixes perf work with observability semantics
  • it would make before/after slow-log comparison noisier

6. Existing tests do not cover the riskiest new behavior

Current coverage exercises many data semantics, but not:

  • warning order under concurrent failures
  • read-phase race behavior
  • proof that processes stayed outside the parallel phase
  • sync-throw fallback behavior for step wrappers

Rule:

  • add those tests explicitly in this patch

Additional note:

  • an existing test already asserts that getTeamData() remains read-only and does not invoke kanban garbage collection
  • that invariant must stay green throughout this patch, not be replaced or weakened

Safe Scope

Safe to parallelize

After config loads successfully:

  • taskReader.getTasks(teamName)
  • inboxReader.listInboxNames(teamName)
  • inboxReader.getMessages(teamName)
  • extractLeadSessionTexts(config)
  • sentMessagesStore.readMessages(teamName)
  • membersMetaStore.getMembers(teamName)
  • kanbanManager.getState(teamName)

Explicitly keep serial

  • configReader.getConfig(teamName)
  • presenceIndexPromise setup logic
  • all message merge/dedup/enrichment logic
  • resolveMembers(...)
  • getMemberAdvisories(...)
  • enrichMemberBranches(...)
  • readProcesses(teamName)

Design Choice

Preferred approach: two-level parallel read phase

After config, split the top read phase into:

Heavy group

Heavier file-system work:

  • getTasks
  • getMessages
  • extractLeadSessionTexts

Run through a small limiter.

Recommended limit:

  • 2

Rules:

  • use a fixed limit in phase 1
  • do not make it adaptive
  • do not make it platform-specific in phase 1

Why:

  • getMessages() already fans out internally
  • getTasks() can read many task files
  • extractLeadSessionTexts() may scan directories and JSONL files
  • fixed 2 is the most predictable low-risk choice before profiling

Light group

Lower-risk direct reads:

  • listInboxNames
  • readMessages(sentMessages)
  • getMembers(metaMembers)
  • getState(kanban)

Run immediately in parallel.

Important nuance:

  • “light” here means “safe to start immediately”, not “guaranteed tiny”
  • if later profiling shows one of these is expensive, that becomes a follow-up patch

Why not one big Promise.all

  • too easy to turn one step failure into whole-snapshot failure
  • too easy to lose deterministic warnings
  • too easy to create uncontrolled disk contention
  • too easy to blur diagnostics and completion timing

Safer fallback if confidence drops during implementation

If implementation reveals unexpected complexity, the fallback patch is:

  • parallelize only the light group
  • keep heavy steps serial

This fallback has lower upside but even lower risk:

  • 🎯 7 🛡️ 10 🧠 3

Rule:

  • prefer shipping the light-group-only version over broadening the patch unsafely

Step Wrapper Design

Do not use scattered try/catch blocks inside concurrent branches.

Introduce a small local helper inside getTeamData() or a tiny private helper in TeamDataService.

Preferred contract:

interface StepResult<T> {
  value: T;
  warning?: string;
  completedAt: number;
}

function startReadStep<T>(options: {
  label: string;
  createFallback: () => T;
  warningText?: string;
  load: () => Promise<T>;
}): Promise<StepResult<T>>

Required semantics:

  • start the load immediately when invoked
  • never reject outward for expected step failures
  • convert both sync throws and async rejects into fallback + warning
  • always record completedAt
  • never mutate shared warning arrays internally
  • keep fallback and warning text explicit at the callsite
  • create fallback values only on failure

Additional rule:

  • warningText should remain a static literal at each callsite
  • do not compute warning text dynamically inside the helper

Why:

  • this patch must preserve wording exactly
  • dynamic warning construction would increase drift risk for no real benefit

Stronger rule:

  • the wrapped load closure should be a thin direct call to one reader/service method
  • do not put assembly logic inside the wrapper closure

Why:

  • broad closures increase the chance that a programming bug gets silently turned into fallback + warning
  • keeping closures thin confines masking behavior to the existing best-effort I/O boundary

Important constraint:

  • keep this helper local to the hot path
  • do not introduce a new shared generic abstraction in the same patch

Why:

  • this is a surgical performance patch
  • broad abstractions increase blast radius and review cost

Additional rule:

  • the helper must not inspect or transform returned values

Additional rule:

  • the helper must not mutate shared outer state such as warnings or marks

Why:

  • any normalization inside the helper would increase the chance of semantic drift
  • the helper should only control error capture and timing metadata

Additional rule:

  • fallback values must be created per step callsite
  • prefer createFallback() over passing a prebuilt object/array
  • do not reuse one module-level empty array or one shared default object between steps

Additional rule:

  • createFallback() must be synchronous, side-effect free, and trivial
  • if createFallback() itself throws, treat that as a programming bug, not an expected degraded read failure

Why:

  • arrays like messages, tasks, and sentMessages are later merged and sometimes mutated
  • shared fallback identity would make accidental aliasing harder to spot in tests
  • lazy fallback creation makes it harder to accidentally share the same object across successes and failures
  • allowing fallback factories to become “smart” would widen the masking boundary beyond the current design intent

Settlement rule

Because each step promise is designed to resolve with a StepResult<T> even on failure:

  • prefer Promise.all(stepPromises) over Promise.allSettled(stepPromises)

Why:

  • Promise.allSettled becomes redundant once the wrapper already settles failures
  • a second settlement layer makes the control flow harder to reason about
  • it increases the chance of accidentally swallowing a bug in the orchestration code itself

Additional rule:

  • do not add per-step logging inside the helper in phase 1

Why:

  • existing observability already relies on slow-log timing and final warnings
  • extra step-local logs would change noise characteristics while the patch goal is latency reduction

Diagnostics And Marks

Current mark(...) timestamps feed the slow log in TeamDataService.ts#L804.

Do not lose that.

Rule

  • each step still needs a mark timestamp
  • mark time should represent actual completion time
  • not “when we finally consumed the result later”

Practical approach

When a step settles, keep completedAt in the returned StepResult.

After Promise.all(...) completes, assign that timestamp into marks[label] from the main control flow.

Example:

  • marks.tasks = tasksStep.completedAt
  • marks.messages = messagesStep.completedAt

Stronger rule:

  • do not mutate marks from inside concurrent branches

Why:

  • the values would still be numerically correct, but the write path becomes harder to audit
  • keeping all diagnostics writes in one deterministic place lowers review risk

Additional rule:

  • do not synthesize marks for unfinished or skipped steps

Why:

  • fake timestamps would make slow logs harder to trust
  • missing marks should remain visible as -1 via current msSince(...) behavior if something truly did not complete

Important nuance

This changes interpretation slightly:

  • before: marks reflected serial stage completion
  • after: marks reflect “time since method start when this async read completed”

That is acceptable, but it should be documented near the helper or marks assignment in code comments.

Warning Discipline

Warnings must remain deterministic.

Do not:

  • call warnings.push(...) inside async branches

Instead:

  1. collect each step result independently
  2. append warnings after the parallel phase settles
  3. append them in the same semantic order as today

Required final order:

  1. Tasks failed to load
  2. Inboxes failed to load
  3. Messages failed to load
  4. Lead session texts failed to load
  5. Sent messages failed to load
  6. Member metadata failed to load
  7. Kanban state failed to load
  8. Member runtime advisories failed to load
  9. Processes failed to load

Phase 1 only changes how steps 1-7 are loaded, not their order.

Exact Safe Execution Order

Phase A - bootstrap

  1. load config
  2. create warnings
  3. compute changePresenceEnabled
  4. compute logSourceSnapshot
  5. start presenceIndexPromise exactly as today

Important rule:

  • do not wrap presenceIndexPromise in startReadStep
  • do not convert a presenceIndexPromise rejection into fallback behavior
  • do not move its await earlier than today

Phase B - parallel read phase

Start these reads:

  • heavy, limiter 2:
    • tasks
    • messages
    • leadTexts
  • light, direct:
    • inboxNames
    • sentMessages
    • metaMembers
    • kanbanState

Wait for all of them to settle.

Preferred implementation detail:

  • create every step promise before awaiting any of them
  • start light steps immediately
  • start heavy steps immediately through the limiter
  • only await results after all step promises exist
  • use one final await point for the whole phase, not separate group awaits

Why:

  • avoids accidental serial gaps
  • makes the intended parallelism explicit
  • prevents “light first, heavy later” drift from sneaking in during refactor

Additional rule:

  • instantiate heavy-step promises in stable semantic order:
    • tasks
    • messages
    • leadTexts

Why:

  • this makes limiter behavior deterministic
  • it simplifies reasoning in tests that verify bounded heavy-step concurrency

Additional rule:

  • do not create step promises inside array literals passed directly into Promise.all

Why:

  • it hides creation order
  • it makes breakpoint/debugging worse
  • it increases the chance of accidental inline logic expansion

Additional rule:

  • queued heavy steps must not execute any reader work before a limiter slot is granted

Why:

  • otherwise the code may look bounded while still front-loading real filesystem work
  • this is an easy place to accidentally defeat the limiter with an eager closure

Preferred composition:

  • queue a thunk that starts the wrapped step only after the limiter grants a slot
  • in other words, prefer runHeavy(() => startReadStep(...)) over a shape that starts startReadStep(...) first and only limits an inner sub-call

Why:

  • this makes “not started yet” observable in tests
  • it avoids subtle eagerness around fallback creation or timing capture
  • it keeps the limiter boundary explicit in code review

Phase C - deterministic assembly

Using the resolved values:

  1. append warnings in fixed order
  2. initialize messages from the resolved inbox messages
  3. append leadTexts
  4. append sentMessages
  5. run existing dedup logic unchanged
  6. run existing session-id enrichment unchanged
  7. run existing slash annotation unchanged
  8. build tasksWithKanbanBase
  9. await presenceIndexPromise
  10. apply presence overlay
  11. call resolveMembers(...)

Important rule:

  • do not mutate the messages array during the parallel read phase

Important rule:

  • preserve the current presenceIndexPromise await point
  • it must still happen after tasksWithKanbanBase exists and before presence overlay is applied

Additional rule:

  • keep local variable names close to the current implementation where practical

Why:

  • smaller diffs reduce review risk in a fragile hot path
  • easier diff-reading improves confidence that only await ordering changed

Phase D - unchanged lower phase

Keep this serial in phase 1:

  1. getMemberAdvisories(...)
  2. enrichMemberBranches(...)
  3. readProcesses(teamName)

Important rule:

  • runtimeAdvisories warning must still precede processes
  • processes warning must still be last

Additional rule:

  • if phase 1 reveals that readProcesses() is also a measurable bottleneck, do not “just parallelize it”
  • handle it in a separate plan because it is a reconcile-like path

Implementation Notes

1. Use a tiny local limiter, not a new dependency

Do not add a library.

Either:

  • reuse a tiny local mapLimit style pattern
  • or add a minimal local limiter helper inside TeamDataService

The helper should be boring and obvious.

2. Preserve fallback defaults exactly

Defaults today:

  • tasks: []
  • inboxNames: []
  • messages: []
  • leadTexts: []
  • sentMessages: []
  • metaMembers: []
  • kanbanState: default empty state
  • processes: []

Do not change any of these defaults.

Stronger rule:

  • instantiate array fallbacks as fresh literals at each step callsite
  • instantiate the empty kanbanState fallback as a fresh object literal at its callsite

Why:

  • it prevents accidental aliasing if later code mutates a returned fallback object
  • it keeps tests honest when assertions depend on reference isolation

3. Keep presenceIndexPromise untouched

It already starts early and is gated correctly.

Changing it would add correctness risk with little payoff.

Stronger rule:

  • preserve both success and failure semantics exactly
  • if presenceIndexPromise rejects today, the method should still reject after the same downstream point

Why:

  • silently degrading change-presence data would be a correctness change, not just a perf change

4. Do not mix adjacent optimizations into this patch

Specifically do not combine:

  • inbox-name reuse between listInboxNames() and getMessages()
  • task reader worker-path changes
  • lead-session dir listing cache

Why:

  • mixed patches are harder to reason about
  • perf attribution gets worse
  • rollback gets harder

5. Prefer the lowest-risk first wrapper conversion during implementation

If doing the implementation incrementally, the first serial wrapper conversion should be one of:

  • membersMetaStore.getMembers(teamName)
  • sentMessagesStore.readMessages(teamName)

Avoid using getMessages() or extractLeadSessionTexts(config) as the first wrapper-conversion step.

Why:

  • they are semantically denser and more file-heavy
  • starting with a lighter reader makes it easier to verify the wrapper boundary before concurrency is introduced

6. Do not add committed benchmark scaffolding in this patch

If extra timing is needed during implementation:

  • use temporary local instrumentation or existing slow-log output
  • remove any temporary timing helpers before committing

Why:

  • the patch goal is a surgical hot-path change, not a new profiling framework
  • leaving ad hoc benchmark scaffolding behind would increase maintenance noise

Edge Cases

1. messages fail but leadTexts succeed

Expected:

  • final messages still contains leadTexts and sentMessages
  • warning includes Messages failed to load

2. leadTexts fail but inbox messages succeed

Expected:

  • final messages remains valid without lead session history
  • warning includes Lead session texts failed to load

3. kanbanState fails

Expected:

  • tasks still load
  • empty kanban fallback is used
  • warning order remains stable
  • renderer still sees safe fallback state

4. metaMembers fail

Expected:

  • resolver still works from config + inbox names + messages
  • warning includes Member metadata failed to load

5. One heavy read is very slow

Expected:

  • light reads are not blocked from starting
  • heavy reads are bounded and do not all pile up at once

6. Duplicate inbox-dir reads happen concurrently

Expected:

  • this is acceptable in phase 1
  • no semantic change occurs
  • this should not be “optimized away” inside the same patch

7. processes marks a dead runtime and writes processes.json

Expected:

  • this still happens
  • it still happens outside the parallel read phase

8. A step throws synchronously before returning a promise

Expected:

  • the step wrapper converts it into fallback + warning
  • getTeamData() still returns a partial snapshot

9. A heavy step finishes before a light step

Expected:

  • marks reflect actual completion times
  • warnings remain in semantic order
  • assembly order remains unchanged

10. A bug occurs in orchestration code outside a step wrapper

Expected:

  • it should still surface as a normal test/runtime failure
  • it must not be hidden behind generic fallback behavior

Why this matters:

  • the wrapper is only meant to preserve current best-effort reader semantics
  • it is not meant to hide mistakes in assembly code

11. presenceIndexPromise rejects

Expected:

  • getTeamData() still rejects
  • no new warning is added
  • the reject point remains after tasksWithKanbanBase construction, as today

Why this matters:

  • changing this would alter correctness semantics for task change presence
  • that is outside the scope of a safe latency patch

12. Heavy-step limiter accidentally serializes everything

Expected:

  • light steps still start immediately
  • at least two heavy steps may be in flight before the first one resolves

Why this matters:

  • a broken limiter implementation can silently collapse back to serial behavior while tests still pass

13. A heavy step is queued behind the limiter and throws synchronously when finally started

Expected:

  • the wrapper still converts it into fallback + warning
  • the limiter slot is released
  • downstream heavy steps can still run

Why this matters:

  • queued execution introduces a second place where sync throws can happen
  • the plan should be explicit that the limiter cannot leak capacity on failure

14. A queued heavy step has not started yet

Expected:

  • its underlying reader has not been called
  • it has not produced a warning
  • it has not stamped completedAt

Why this matters:

  • otherwise the limiter may only be cosmetically bounding concurrency
  • timing diagnostics would become misleading if queue time and completion time are confused

15. createFallback() is evaluated

Expected:

  • it is evaluated only after the corresponding load fails
  • it does not perform I/O or mutate shared state
  • if it throws, the test should fail rather than being silently degraded

Why this matters:

  • fallback factories are part of the masking boundary
  • making them “smart” would quietly widen the behavior of the patch

Testing Plan

Add focused tests in TeamDataService.test.ts.

Preferred test technique

Prefer explicit deferred promises over fake timers for read-phase ordering tests.

Why:

  • fake timers tend to couple tests to implementation details like setTimeout
  • deferred promises make start-order and release-order assertions clearer
  • the existing test file already uses direct promise orchestration patterns successfully

Required tests

  1. Parallel read phase does not abort full getTeamData() when one step fails.

  2. Warning order remains the same as before even when failures resolve out of order.

  3. messages merge order remains unchanged:

    • inbox
    • plus leadTexts
    • plus sentMessages
    • then current dedup/enrichment behavior
  4. resolveMembers() is still called with the same effective inputs.

  5. readProcesses() still runs after the top read phase and still populates processes.

  6. kanban fallback remains safe when kanban read fails.

  7. leadTexts success still contributes messages even if inbox read fails.

  8. slow diagnostics do not crash when marks are produced from async completion times.

  9. a step that throws synchronously still degrades to fallback + warning.

  10. processes remains outside the parallel read phase.

  11. orchestration still fails normally if an error is thrown after step resolution and outside the wrapper boundary.

  12. light steps start even while heavy steps are still blocked.

  13. heavy-step limiter still allows bounded overlap and does not collapse to fully serial execution.

  14. presenceIndexPromise rejection semantics remain unchanged.

  15. no concurrent branch writes directly into warnings or marks.

  16. the existing read-only invariant for getTeamData() remains intact and kanban GC is still not invoked from this path.

  17. the third heavy step does not call its underlying reader before one of the first two heavy steps settles.

  18. a queued heavy step does not stamp completedAt before it actually settles.

  19. missing-team fast-fail still happens before any read-phase step starts.

  20. fallback creation happens only on failure and does not itself become a hidden degraded path.

Nice-to-have tests

  1. prove heavy-group limiter behavior with deferred promises
  2. prove warnings do not depend on completion order
  3. prove light steps can finish before heavy ones without changing assembly order
  4. prove heavy-step start order remains deterministic under the limiter
  5. prove queued heavy-step failure releases limiter capacity for the next heavy step
  6. prove fallback arrays/objects are not shared across unrelated step results
  7. prove missing-team fast-fail still happens before any read-phase step starts
  8. prove fallback creation is lazy and does not run on successful steps

Testing note

Current tests exercise many getTeamData() semantics, but they do not currently prove:

  • warning order
  • read-phase race behavior
  • serial processes

These tests must be added explicitly in this patch.

Additional testing rule:

  • at least one new test should assert start order, not only final output

Why:

  • output-only assertions can miss accidental serialization or accidental over-parallelization

Rollout Strategy

Phase 1

Implement only:

  • local step wrapper
  • small local heavy-step limiter
  • top-level parallel read phase
  • deterministic warning collection
  • explicit tests for warning order and serial processes
  • explicit tests for bounded heavy overlap and immediate light-step start
  • explicit test for unchanged presenceIndexPromise rejection behavior
  1. Add the local StepResult<T> shape and startReadStep(...) helper without changing behavior yet.
  2. Move one low-risk light read to the helper pattern and keep it serial to verify the wrapper boundary stays thin.
  3. Introduce the local heavy limiter with a tiny focused test before wiring all heavy steps through it.
  4. Convert the full top read phase into:
    • light steps started directly
    • heavy steps started via the limiter
  5. Keep warning collection and mark assignment outside the wrappers and outside concurrent branches.
  6. Re-run the lower assembly with local variable names as close to the current code as practical.
  7. Add the ordering and failure-mode tests last, after the implementation shape stabilizes.

Why this order:

  • it keeps each transition auditable
  • it makes it easier to catch accidental semantic drift before the whole method is reshaped

Additional rule:

  • do not combine the helper-contract change and the full concurrency rewrite in one blind edit

Why:

  • separating those transitions makes review of semantic drift much easier

Phase 2 only if needed later

  • optional getLeadSessionJsonlPaths() cache
  • optional inbox-name reuse between listInboxNames() and getMessages()
  • optional broader lower-phase parallelization

Success Criteria

The patch is successful if all are true:

  1. TeamData payload shape is unchanged
  2. existing tests pass
  3. new warning-order and race tests pass
  4. typecheck passes
  5. targeted getTeamData latency improves on repeated refreshes
  6. no warning-order drift
  7. no changes to dedup or renderer-visible semantics

Areas With Lower Confidence

1. Best heavy-step limiter value

Confidence: medium.

Why:

  • 2 is the safest educated default
  • it is not backed by real profiling data yet

Decision:

  • use 2 in phase 1
  • tune only after measurement

2. Real contention impact of concurrent getMessages() plus extractLeadSessionTexts()

Confidence: medium.

Why:

  • both touch team filesystem data
  • one already does internal fan-out

Decision:

  • keep the heavy group bounded
  • do not add more heavy steps to it in phase 1

3. Diagnostic interpretation shift

Confidence: medium-high.

Why:

  • marks will no longer describe a strictly serial stage pipeline
  • but they remain operationally useful

Decision:

  • preserve labels
  • store actual completion times
  • document the interpretation shift

4. Perf upside size

Confidence: medium-high.

Why:

  • the current serial top read phase is clearly a bottleneck candidate
  • exact gains still depend on filesystem behavior and team size

Decision:

  • keep the patch narrow
  • verify with targeted timings before any broader refactor

5. Exact wrapper boundary

Confidence: medium.

Why:

  • too broad a wrapper masks bugs
  • too narrow a wrapper creates repetitive code and inconsistent fallback handling

Decision:

  • wrap only the direct reader/service call
  • keep all merge/assembly logic outside the wrapper

6. presenceIndexPromise semantics

Confidence: medium.

Why:

  • it starts early but is awaited later
  • it currently rejects the whole method instead of degrading to warning
  • it is easy to accidentally “improve” this while introducing wrappers

Decision:

  • preserve current reject semantics exactly
  • add a regression test explicitly for this

7. Test strength for concurrency behavior

Confidence: medium.

Why:

  • concurrency tests often pass while failing to prove the intended scheduling shape
  • the current suite has very limited ordering-specific coverage

Decision:

  • use explicit deferred promises
  • assert both start order and completion-order independence

8. Limiter implementation correctness

Confidence: medium.

Why:

  • small limiters are easy to get almost right while still leaking capacity or starting work eagerly
  • that kind of bug can preserve output correctness while eliminating the perf gain

Decision:

  • keep the limiter tiny and local
  • verify queued-failure release behavior in tests
  • verify that the third heavy step does not start before one of the first two settles

9. Wrapper and limiter composition

Confidence: medium.

Why:

  • the code can look “limited” while still eagerly starting reader work
  • this is one of the easiest places to lose the intended perf behavior without breaking output tests

Decision:

  • prefer queuing a thunk that starts startReadStep(...) only after slot acquisition
  • add a test that asserts the third heavy reader has not been called while both slots are occupied

10. Fallback-factory discipline

Confidence: medium.

Why:

  • createFallback() looks harmless, but it is part of the same masking boundary as the wrapper
  • if it grows side effects or branching logic, the patch stops being a narrow perf change

Decision:

  • keep fallback factories as trivial literals/object constructors
  • keep warning text static at the callsite
  • let thrown fallback-factory errors surface as programming bugs

Residual Risk

Even after this patch, the architecture remains sensitive because:

  • team:getData still assembles a large monolithic snapshot
  • messages[] is still a hot payload
  • first-open cost still exists
  • processes remains a reconcile-like read/write path

This patch should therefore be treated as:

  • a safe latency reduction
  • not a final architectural solution