agent-ecosystem/.planning/research/PITFALLS.md
matt ae4833e310 feat(planning): add project planning documents and configuration for SSH multi-context workspaces
- Introduced new planning files including PROJECT.md, REQUIREMENTS.md, ROADMAP.md, and STATE.md to outline the vision and requirements for SSH multi-context workspaces.
- Added ARCHITECTURE.md and CONCERNS.md to detail the codebase structure and address technical debt, known bugs, and security considerations.
- Created CONVENTIONS.md to establish coding standards and practices for the project.
- Updated .gitignore to exclude demo files and added configuration for planning tools.

This commit lays the groundwork for enhancing SSH functionality and user experience in managing multiple workspaces.
2026-02-12 09:37:58 +09:00

24 KiB

Pitfalls Research

Domain: Multi-context workspace switching for Electron + Zustand apps Researched: 2026-02-12 Confidence: HIGH


Critical Pitfalls

Pitfall 1: Destructive Context Switching with Incomplete State Snapshots

What goes wrong: Switching from SSH to local (or vice versa) calls getFullResetState() which wipes ALL selections and loaded data, destroying the user's previous workspace state. When switching back, users must re-navigate to their project, re-open sessions, and restore their scroll position manually.

Why it happens: Current implementation treats context switch as a "hard reset" instead of maintaining separate state snapshots per context. The connectionSlice spreads getFullResetState() on connect/disconnect, clearing selectedProjectId, selectedSessionId, conversation, tabs, and all derived state.

How to avoid:

  • Snapshot before switching: Capture full AppState before connect/disconnect
  • Scope snapshots by context: Use Map<'local' | SshHostKey, AppState> to store separate state per context
  • Partial restoration: Only restore state compatible with the new context (project IDs may differ between local/SSH)
  • Preserve UI preferences: Tab layout, expanded groups, scroll positions are context-agnostic and should always restore

Warning signs:

  • User complaints about "losing their place" when switching contexts
  • Empty dashboard after SSH connect, despite having viewed sessions before
  • No tabs open after switching back to local mode
  • Context panel expansions reset on every switch

Phase to address: Phase 1 (Core Infrastructure) — state snapshot/restore must be foundational before adding multi-context features.


Pitfall 2: EventEmitter Listener Accumulation on Repeated Context Switches

What goes wrong: FileWatcher, SshConnectionManager, and NotificationManager are EventEmitters. Each context switch may call setFileSystemProvider() or re-initialize services without removing old listeners, causing:

  • Memory leaks (listeners hold references to previous provider instances)
  • Duplicate event emissions (one event triggers 2x, 3x handlers)
  • Stale event handlers executing against wrong context

Why it happens: Services use .on() to attach listeners but don't call .removeAllListeners() or track cleanup functions. The initializeNotificationListeners() function in store/index.ts returns a cleanup function, but there's no guarantee it's called before re-initialization.

How to avoid:

  • Centralized cleanup: Create dispose() methods for all EventEmitter services
  • Lifecycle tracking: Maintain Map<string, () => void> of cleanup functions keyed by service name
  • Pre-switch cleanup: Call all cleanup functions BEFORE provider swap
  • Idempotent initialization: Check if (this.isInitialized) return to prevent double-init
  • Avoid removeAllListeners() on ipcRenderer: Always specify channel name — blanket removal breaks Electron internals

Warning signs:

  • Process memory usage (RSS) grows 50-100MB per context switch
  • DevTools heap snapshot shows multiple FileWatcher/SshConnectionManager instances
  • Console logs show duplicate "file-change" events for same file
  • IPC handlers fire 2x for single user action

Phase to address: Phase 1 (Core Infrastructure) — service lifecycle management is critical before multi-context.

Sources:


Pitfall 3: Stale Closures in Zustand Actions Capturing Old Provider References

What goes wrong: Zustand actions like fetchProjects() capture the FileSystemProvider in their closure at store creation time. After calling sshConnectionManager.getProvider() which returns a new SshFileSystemProvider, existing actions still reference the old LocalFileSystemProvider, causing operations to execute against the wrong context.

Why it happens: JavaScript closures capture variables at definition time. When a Zustand action calls window.electronAPI.getProjects(), the IPC handler calls projectScanner.scan(), which calls this.fsProvider.readdir(). If the scanner's provider reference wasn't updated, it uses the stale provider.

How to avoid:

  • Late binding via getter: Services should call getProvider() on every operation, not cache at construction
  • Provider as parameter: Pass provider explicitly to service methods instead of storing as instance variable
  • Re-initialize services: After provider swap, call fileWatcher.setFileSystemProvider(newProvider) on ALL services
  • Service registry pattern: Centralize provider injection so swap happens in one place
  • Verify in tests: Mock provider swap and assert service calls hit new provider

Warning signs:

  • SSH connect succeeds but dashboard shows local projects
  • File operations fail with "ENOENT" after provider swap
  • Console shows "LocalFileSystemProvider" operations when SSH is connected
  • User switches to SSH but sees local data until app restart

Phase to address: Phase 1 (Core Infrastructure) — provider injection architecture must be correct from the start.

Sources:


Pitfall 4: IPC Race Conditions During Context Switch with In-Flight Requests

What goes wrong: User clicks "Connect SSH" → IPC handler starts scanning remote projects → user quickly clicks "Disconnect" → scan completes and populates store with SSH data → store now shows SSH projects in local mode, creating data corruption.

Why it happens: Async IPC calls (getProjects, getSessions) don't track which context initiated them. By the time a response arrives, the context may have switched, but Zustand applies the stale response anyway.

How to avoid:

  • Context ID stamping: Include contextId (UUID per connection) in every IPC request/response
  • Response validation: Check if (currentContextId !== response.contextId) return before applying
  • Request cancellation: Track in-flight requests via AbortController and cancel on context switch
  • State machine: Use FSM with states: idle → connecting → connected → disconnecting → idle. Reject operations in wrong states.
  • Sequential transitions: Wait for disconnect cleanup to complete before starting connect

Warning signs:

  • Dashboard briefly flickers between local and SSH data during transitions
  • Error notifications appear for wrong context after switching
  • Search results from previous context appear after switch
  • Console shows "Cannot read property of null" after rapid connect/disconnect

Phase to address: Phase 1 (Core Infrastructure) — race condition prevention is non-negotiable for reliability.

Sources:


Pitfall 5: FileWatcher Polling Timer Not Cleared on SSH Disconnect

What goes wrong: SSH mode uses setInterval() polling (5s) instead of fs.watch(). When user disconnects, stop() clears pollingTimer, but the cleanup happens in finally block — if an exception occurs, timer keeps running. After 10 switches, 10 polling timers consume CPU checking a disconnected SSH provider.

Why it happens: Interval timers require explicit clearInterval(), and error paths may skip cleanup. The pollingTimer property is set to null without actually clearing the interval, causing orphaned timers.

How to avoid:

  • Cleanup in finally: Always clear timers in finally block, not just normal path
  • Defensive clearing: Before creating new timer, clear existing: if (this.timer) clearInterval(this.timer)
  • Timer registry: Track all active timers in Set<NodeJS.Timeout> and clear all on dispose
  • Polling refactor: Consider using a single interval that checks isRemote() flag instead of separate modes
  • Memory profiling: Add setInterval/clearInterval tracking to detect orphaned timers

Warning signs:

  • CPU usage increases by 5-10% per context switch
  • Chrome DevTools → Performance shows multiple "pollForChanges" concurrent executions
  • Memory usage grows 20MB per switch (timer closures hold provider references)
  • App becomes sluggish after 5-10 SSH switches

Phase to address: Phase 1 (Core Infrastructure) — timer lifecycle bugs cause performance degradation.


Pitfall 6: SSH Connection Not Properly Disposed, Keeping Socket Open

What goes wrong: SshConnectionManager.disconnect() calls client.end(), but if the SFTP channel is still processing operations, the socket remains open (TCP FIN not sent). After 10 context switches, 10 SSH connections consume file descriptors and remote server resources.

Why it happens: ssh2 Client .end() is graceful (waits for channel closure), but .destroy() is immediate. If you don't wait for SFTP operations to complete before calling .end(), the connection may linger. The dispose() method swallows errors, hiding cleanup failures.

How to avoid:

  • Force close on disconnect: Use client.destroy() instead of .end() for immediate termination
  • SFTP channel close: Call sftp.end() explicitly before client.end()
  • Connection timeout: Set 5s timeout on disconnect — if client doesn't close, force destroy
  • Socket tracking: Expose net.Socket from Client and verify socket.destroyed === true post-cleanup
  • Resource monitoring: Log open file descriptors (lsof) to detect leaks

Warning signs:

  • lsof -p <pid> shows 10+ TCP connections to SSH host
  • SSH host reports "Max sessions exceeded" after many switches
  • netstat shows multiple ESTABLISHED connections to port 22
  • Disconnect takes 5+ seconds (waiting for timeout)

Phase to address: Phase 1 (Core Infrastructure) — connection leaks cause operational issues at scale.

Sources:


Pitfall 7: Tab State Restoration Without Context Validation

What goes wrong: Snapshot captures open tabs with projectId: "abc-local-path". User switches to SSH where projects have different IDs. Restoration tries to open tab with local projectId against SSH provider → tab shows "Project not found" error or stale local data.

Why it happens: ProjectId encoding includes file path (-Users-name-project), so local and SSH projects for the same logical directory have different IDs. Tab restoration doesn't validate whether a project exists in the new context.

How to avoid:

  • Path normalization: Introduce logicalProjectPath (e.g., /home/user/project) separate from encoded ID
  • Cross-context mapping: Maintain Map<logicalPath, {localId, sshId}> to translate IDs during restore
  • Graceful degradation: If project doesn't exist in new context, close tab or show empty state instead of error
  • Restore validation: Before applying snapshot, call window.electronAPI.getProjects() and filter tabs to existing projects
  • User notification: Show toast "3 tabs closed - projects not available in SSH mode"

Warning signs:

  • User switches to SSH, tabs show "Failed to load session"
  • Console errors "Project ID abc-local not found" after context switch
  • Tab bar shows ghost tabs (visible but non-functional)
  • Click on restored tab triggers navigation to dashboard instead

Phase to address: Phase 2 (State Persistence) — after Phase 1 establishes snapshots, validation ensures safe restoration.


Pitfall 8: Partial Snapshot Creates Inconsistent Derived State

What goes wrong: Snapshot captures selectedProjectId, selectedSessionId, but omits sessionDetail, conversation, chunks. Restoration sets selections without corresponding data → store state says "session XYZ selected" but detail is null → UI renders empty or crashes on null access.

Why it happens: Developers manually list fields to snapshot instead of capturing full AppState. Derived state (detail, conversation) is computed from selections, but if selections restore before data loads, the UI sees inconsistent state.

How to avoid:

  • Snapshot everything: Capture AppState wholesale, excluding only ephemeral UI flags (loading, error)
  • Atomic restoration: Use setState(() => snapshot) so all fields update simultaneously
  • Lazy detail loading: On restoration, if selectedSessionId exists but sessionDetail is null, trigger background fetch
  • State versioning: Include snapshotVersion: 1 to handle schema changes (e.g., new fields added)
  • Whitelist vs blacklist: Easier to exclude ephemeral fields (loading, error) than enumerate 50+ persistent fields

Warning signs:

  • After switch, session viewer shows spinner indefinitely
  • Console error "Cannot read property 'chunks' of null"
  • Dashboard shows selected project but no sessions loaded
  • Tab labels show "undefined" or "[object Object]"

Phase to address: Phase 1 (Core Infrastructure) — snapshot completeness is foundational to state restoration.

Sources:


Technical Debt Patterns

Shortcuts that seem reasonable but create long-term problems.

Shortcut Immediate Benefit Long-term Cost When Acceptable
Storing provider as instance variable Simple, no getter needed Stale closure bugs on provider swap Never — always use getter
Not tracking cleanup functions Less boilerplate code Memory leaks accumulate with usage Never — cleanup is mandatory
Hard reset on context switch Easy to implement (3 lines) User loses workspace state Never — destroys UX
Using removeAllListeners() Clears everything at once Breaks Electron IPC internals Never — specify channel
Sync IPC for connect Blocks until complete UI freezes 2-5s on slow networks Never — SSH is async
Single global DataCache Shared cache across contexts Cache pollution (local data in SSH mode) Never — scope cache per context
Manual field-by-field snapshot Fine-grained control Fragile, breaks when state schema evolves Only for specific performance tuning
Not versioning snapshots Don't need migration code Can't safely restore after app updates Only in MVP — add versioning in Phase 2

Integration Gotchas

Common mistakes when connecting to external services.

Integration Common Mistake Correct Approach
SSH2 Client Using .end() without waiting for SFTP close Call sftp.end(), wait 100ms, then client.destroy()
SFTP operations Not catching ENOENT on remote path access Wrap all sftp.stat() in try/catch, validate paths exist
IPC context bridge Passing large objects (10MB+ session JSON) Stream via chunks or use temp file + file path
FileWatcher SSH polling Polling every 1s (too aggressive) 5s minimum — SSH has latency
Zustand subscriptions Subscribing to full store on every render Use selectors: useStore((s) => s.field, shallow)
React useEffect Empty deps array with state access Include all state in deps or use functional updates

Performance Traps

Patterns that work at small scale but fail as usage grows.

Trap Symptoms Prevention When It Breaks
Cache not scoped by context Search returns SSH results in local mode Separate caches: localCache, sshCache After first context switch
Snapshot entire 50MB state 500ms freeze during switch Exclude non-serializable (functions, DOM refs) 10+ open sessions
No debounce on provider swap events 10 re-renders per switch Debounce 100ms, batch provider updates Rapid switching (QA testing)
Loading all tabs eagerly UI freezes restoring 20 tabs Virtual scrolling + lazy load tab content 10+ open tabs
Not canceling in-flight IPC Stale responses overwrite new data AbortController per IPC call Slow network (SSH over VPN)
Synchronous validation during snapshot Blocks UI during switch Validate asynchronously post-restoration Large state (100+ projects)

Security Mistakes

Domain-specific security issues beyond general web security.

Mistake Risk Prevention
Storing SSH passwords in snapshot Plaintext credentials in memory dump Never snapshot credentials — only host/port/user
Not validating SSH host keys MITM attacks Use ssh2 hostVerifier callback, check known_hosts
Exposing SSH provider to renderer Renderer can execute arbitrary commands Keep provider in main process only, expose via IPC
Logging full AppState snapshots Sensitive data in logs Redact: password, privateKey, sessionContent
Not sanitizing project paths Path traversal attacks Validate paths with path.resolve(), reject ".."

UX Pitfalls

Common user experience mistakes in this domain.

Pitfall User Impact Better Approach
No visual feedback during switch User doesn't know if click worked Show connecting spinner + progress bar
Immediate switch without confirmation Accidental clicks lose workspace "Switch will close X tabs. Continue?" dialog
No indication of current context User forgets if in local/SSH Persistent badge in header: "Local" or "SSH: hostname"
Context switch closes all tabs User loses 10+ open sessions Restore tabs if projects exist in new context
No way to view both contexts Can't compare local vs SSH data Split view or dual panes (future enhancement)
Errors shown as raw exceptions "ENOTFOUND" means nothing to user Friendly: "Cannot connect to SSH host. Check network."

"Looks Done But Isn't" Checklist

Things that appear complete but are missing critical pieces.

  • Context switch — Looks like it works because SSH connects, but missing: cleanup old listeners, cancel in-flight IPC, validate restored tabs
  • State snapshot — Appears to save state, but missing: exclude ephemeral fields, version snapshots, validate restorable
  • Provider swap — FileWatcher updated, but missing: update ProjectScanner, SessionParser, ErrorDetector, NotificationManager
  • Tab restoration — Tabs re-open, but missing: validate project exists, load session detail, restore scroll position
  • IPC request tracking — Responses return, but missing: context ID validation, abort on disconnect, timeout handling
  • Service dispose — dispose() method exists, but missing: remove EventEmitter listeners, clear timers, close connections
  • Cache invalidation — Cache cleared on switch, but missing: scope cache per context, invalidate on reconnect
  • Error handling — Try/catch present, but missing: rollback state on failure, show user-friendly message, log to sentry

Recovery Strategies

When pitfalls occur despite prevention, how to recover.

Pitfall Recovery Cost Recovery Steps
Destructive switch lost state MEDIUM Phase 2: Add persistent storage (localStorage) to survive restarts
Listener leaks LOW Add dispose tracking, expose debug.listenerCount() IPC for diagnosis
Stale closure HIGH Refactor service architecture to inject provider per call (breaking change)
IPC race condition MEDIUM Add context ID retroactively, reject responses with mismatched ID
Timer leak LOW DevTools audit: setInterval/clearInterval coverage, add cleanup tests
SSH connection leak LOW Add connection pool with max limit, force destroy after 30s timeout
Tab validation missing MEDIUM Add migration: filter invalid tabs on app startup
Partial snapshot HIGH Ship hotfix: capture full state, deprecate manual field list

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall Prevention Phase Verification
Destructive context switching Phase 1 Unit test: switch → switch back → assert state restored
EventEmitter listener accumulation Phase 1 Memory profiler: assert RSS stable after 10 switches
Stale closures Phase 1 Integration test: mock provider swap, assert service calls new provider
IPC race conditions Phase 1 E2E test: rapid connect/disconnect, assert no stale data
FileWatcher polling timer Phase 1 Unit test: call stop(), assert pollingTimer cleared
SSH connection disposal Phase 1 Integration test: assert socket.destroyed after disconnect
Tab restoration validation Phase 2 E2E test: open tabs → switch SSH → assert only valid tabs restored
Partial snapshot Phase 1 Integration test: snapshot → restore → assert detail populated

Sources

Official Documentation

Electron State Management

Memory Leaks and Cleanup

React Stale Closures

Zustand and State Management

VS Code Multi-Root Workspaces (Reference Architecture)

Dependency Injection and Service Lifecycle


Pitfalls research for: Multi-context workspace switching in Electron + Zustand Researched: 2026-02-12