LoRA Genomic Evolution: GPU-aware genome paging for local expert models#280
LoRA Genomic Evolution: GPU-aware genome paging for local expert models#280
Conversation
Detects real GPU VRAM at startup via Metal API (macOS) or nvidia-smi (CUDA), allocates budgets across three subsystems (inference 75%, TTS 10%, rendering 10%, 5% reserve), and replaces the hardcoded 200MB genome budget with real per-persona budgets derived from actual hardware. - GpuMemoryManager singleton with RAII allocation guards (like SlotGuard) - Metal recommendedMaxWorkingSetSize detection, CPU fallback (25% RAM) - Pressure tracking via tokio::watch channel (0-1.0 broadcast) - GpuModule IPC: gpu/stats, gpu/pressure commands - CognitionState wired to GPU manager for genome paging real budgets - cognition/gpu-budget query for TypeScript initialization - LimbicSystem sends budget=0 (Rust decides from real GPU detection) - 17 new Rust tests, all existing genome/cognition tests pass
…mory manager Three-layer integration: Rust GpuModule IPC → TypeScript GpuMixin → generated command. Now discoverable via ./jtag gpu/stats with real hardware detection (Metal/CUDA). Documented the Rust-backed command workflow pattern in CLAUDE.md.
There was a problem hiding this comment.
Pull request overview
Introduces a unified, GPU-aware VRAM budgeting/tracking layer in continuum-core and exposes it through IPC + the generated TypeScript command system, so genome paging and other subsystems can rely on real hardware-derived budgets instead of hardcoded defaults.
Changes:
- Added Rust
GpuMemoryManager(VRAM detection + per-subsystem budgets, pressure tracking, RAII allocation guards) and aGpuModuleIPC surface (gpu/stats,gpu/pressure). - Wired the GPU manager into IPC startup and cognition budget selection (including TS “budget=0 means Rust decides” flow).
- Added a generated
gpu/statscommand scaffold + RustCoreIPC mixin + generated TS types for GPU stats.
Reviewed changes
Copilot reviewed 30 out of 35 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/workers/continuum-core/src/runtime/runtime.rs | Adds gpu to the required module list. |
| src/workers/continuum-core/src/persona/unified.rs | Adds PersonaCognition::with_budget() and uses it for genome paging engine init. |
| src/workers/continuum-core/src/modules/mod.rs | Exposes the new Rust gpu module. |
| src/workers/continuum-core/src/modules/gpu.rs | New IPC module implementing gpu/stats + gpu/pressure. |
| src/workers/continuum-core/src/modules/cognition.rs | Integrates GPU-derived per-persona budgets and adds a GPU budget query command. |
| src/workers/continuum-core/src/lib.rs | Exports the new gpu module from the crate root. |
| src/workers/continuum-core/src/ipc/mod.rs | Instantiates GpuMemoryManager, registers GpuModule, and injects into cognition state. |
| src/workers/continuum-core/src/gpu/mod.rs | New GPU module root + re-exports. |
| src/workers/continuum-core/src/gpu/memory_manager.rs | Core GPU detection + budgeting + pressure tracking + RAII guards + tests + ts-rs exports. |
| src/workers/continuum-core/bindings/modules/gpu.ts | New RustCoreIPC mixin mapping snake_case Rust stats to camelCase TS. |
| src/workers/continuum-core/bindings/RustCoreIPC.ts | Adds the GPU mixin into the composed IPC client. |
| src/workers/continuum-core/Cargo.toml | Adds macOS Metal dependency for VRAM detection. |
| src/system/user/server/modules/being/LimbicSystem.ts | Switches memoryBudgetMB to 0 to delegate budget selection to Rust. |
| src/system/user/server/modules/PersonaGenome.ts | Treats memoryBudgetMB=0 as “GPU-managed mode” (no local eviction, pressure=0). |
| src/shared/version.ts | Bumps shared version. |
| src/shared/generated/index.ts | Re-exports generated GPU types. |
| src/shared/generated/gpu/index.ts | Barrel export for generated GPU types. |
| src/shared/generated/gpu/SubsystemStats.ts | ts-rs generated type for subsystem stats. |
| src/shared/generated/gpu/GpuStats.ts | ts-rs generated type for full GPU stats. |
| src/shared/generated-command-constants.ts | Adds GPU_STATS command constant. |
| src/server/generated.ts | Registers gpu/stats in the server command registry. |
| src/browser/generated.ts | Registers gpu/stats in the browser command registry. |
| src/package.json | Package version bump. |
| src/package-lock.json | Lockfile version bump. |
| src/generator/specs/gpu-stats.json | New generator spec for gpu/stats. |
| src/generated-command-schemas.json | Regenerated command schema output including gpu/stats. |
| src/commands/gpu/stats/shared/GpuStatsTypes.ts | Generated shared types and executor for gpu/stats. |
| src/commands/gpu/stats/server/GpuStatsServerCommand.ts | Server implementation routing to Rust IPC mixin. |
| src/commands/gpu/stats/browser/GpuStatsBrowserCommand.ts | Browser implementation delegating to server. |
| src/commands/gpu/stats/test/unit/GpuStatsCommand.test.ts | Generated unit-test scaffold/reference example. |
| src/commands/gpu/stats/test/integration/GpuStatsIntegration.test.ts | Generated integration-test scaffold/reference example. |
| src/commands/gpu/stats/README.md | Generated command documentation for gpu/stats. |
| src/commands/gpu/stats/package.json | Package metadata for the generated command package. |
| src/commands/gpu/stats/.npmignore | Ignore rules for publishing the generated command package. |
| CLAUDE.md | Documents the “Rust-backed command IPC mixin” 3-layer workflow. |
Files not reviewed (1)
- src/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)
src/workers/continuum-core/Cargo.toml:121
- The
[target.'cfg(target_os = "macos")'.dependencies]table starts at line 110, sosimilar,ignore,regex,rusqlite, and the postgres deps below it are now treated as macOS-only dependencies. This will break non-macOS builds. Move the target-specificmetaldependency to the end (or re-open a[dependencies]table after it) so the rest of the deps remain unconditional.
# GPU detection — Metal API for VRAM detection on macOS
[target.'cfg(target_os = "macos")'.dependencies]
metal = "0.31"
# Code module — file operations, change tracking, code intelligence
similar = "2.6" # Unified diff computation
ignore = "0.4" # .gitignore-aware file walking (from ripgrep)
regex = "1" # Regex search for code search
# ORM module — database-agnostic storage with adapter traits
rusqlite = { version = "0.32", features = ["bundled"] } # SQLite adapter
deadpool-postgres.workspace = true # Postgres connection pool
tokio-postgres.workspace = true # Postgres async driver
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -85,15 +105,20 @@ impl CognitionModule { | |||
|
|
|||
| /// Helper: get or create persona, returning mutable ref via DashMap entry API. | |||
| /// Used by commands that need to lazily create persona state. | |||
| /// Uses GPU manager's per-persona budget when available, 200MB otherwise. | |||
| macro_rules! get_or_create_persona { | |||
| ($self:expr, $persona_uuid:expr) => { | |||
| $self.state.personas | |||
| .entry($persona_uuid) | |||
| .or_insert_with(|| PersonaCognition::new( | |||
| $persona_uuid, | |||
| String::new(), | |||
| $self.state.rag_engine.clone(), | |||
| )) | |||
| .or_insert_with(|| { | |||
| let budget = $self.state.per_persona_budget_mb(); | |||
| PersonaCognition::with_budget( | |||
| $persona_uuid, | |||
| String::new(), | |||
| $self.state.rag_engine.clone(), | |||
| budget, | |||
| ) | |||
There was a problem hiding this comment.
The per-persona budget is computed from self.personas.len(), but when called inside or_insert_with the new persona is not yet in the map. This means newly-created personas get a budget sized for N-1 personas, and existing personas never get rebalanced as persona count changes (GenomePagingEngine stores a fixed memory_budget_mb). Consider using len() + 1 for the insertion path and/or adding a mechanism to update existing persona genome budgets when the active persona set changes.
| "description": "Query GPU memory manager stats including VRAM detection, per-subsystem budgets (inference, TTS, rendering), usage tracking, and memory pressure. Returns real hardware data from Metal (macOS) or CUDA APIs.", | ||
| "params": [ | ||
| { | ||
| "name": "subsystem", | ||
| "type": "string", | ||
| "optional": true, | ||
| "description": "Filter to specific subsystem: 'inference', 'tts', or 'rendering'. Omit for full stats." | ||
| } | ||
| ], | ||
| "results": [ | ||
| { | ||
| "name": "gpuName", | ||
| "type": "string", | ||
| "description": "GPU hardware name (e.g., 'Apple M3 Max', 'NVIDIA RTX 5090')" | ||
| }, | ||
| { | ||
| "name": "totalVramMb", | ||
| "type": "number", | ||
| "description": "Total detected VRAM in MB" | ||
| }, | ||
| { | ||
| "name": "totalUsedMb", | ||
| "type": "number", | ||
| "description": "Total VRAM used across all subsystems in MB" | ||
| }, | ||
| { | ||
| "name": "pressure", | ||
| "type": "number", | ||
| "description": "Memory pressure 0.0-1.0 (0=idle, 0.6=warning, 0.8=high, 0.95=critical)" | ||
| }, | ||
| { | ||
| "name": "reserveMb", | ||
| "type": "number", | ||
| "description": "Reserved headroom in MB (5% of total, prevents OOM)" | ||
| }, | ||
| { | ||
| "name": "rendering", | ||
| "type": "SubsystemInfo", | ||
| "description": "Rendering subsystem budget and usage" | ||
| }, | ||
| { | ||
| "name": "inference", | ||
| "type": "SubsystemInfo", | ||
| "description": "Inference subsystem budget and usage (models, LoRA adapters)" | ||
| }, | ||
| { | ||
| "name": "tts", | ||
| "type": "SubsystemInfo", | ||
| "description": "TTS subsystem budget and usage" | ||
| } | ||
| ], | ||
| "examples": [ | ||
| { | ||
| "description": "Get full GPU stats", | ||
| "command": "./jtag gpu/stats", | ||
| "expectedResult": "{ gpuName: 'Apple M3 Max', totalVramMb: 36864, pressure: 0.12, inference: { budgetMb: 25804, usedMb: 3200 }, ... }" | ||
| }, | ||
| { | ||
| "description": "Get inference subsystem only", | ||
| "command": "./jtag gpu/stats --subsystem=inference", | ||
| "expectedResult": "{ gpuName: 'Apple M3 Max', totalVramMb: 36864, pressure: 0.12, inference: { budgetMb: 25804, usedMb: 3200 } }" | ||
| } |
There was a problem hiding this comment.
The CommandSpec documents an optional subsystem filter and examples show --subsystem=inference, but the current Rust/TS implementation always returns the full stats and ignores params. Either remove the param/examples from the spec, or implement filtering end-to-end (pass subsystem through and have Rust/TS return a filtered response).
|
|
||
| return createGpuStatsResultFromParams(params, { | ||
| success: true, | ||
| gpuName: stats.gpuName, | ||
| totalVramMb: stats.totalVramMb, | ||
| totalUsedMb: stats.totalUsedMb, | ||
| pressure: stats.pressure, | ||
| reserveMb: stats.reserveMb, | ||
| rendering: stats.rendering, | ||
| inference: stats.inference, | ||
| tts: stats.tts, | ||
| }); | ||
| } finally { | ||
| this.rustClient.disconnect(); | ||
| } | ||
| } |
There was a problem hiding this comment.
GpuStatsServerCommand.execute() ignores params.subsystem even though the generated types/spec expose it. If the filter is intended to work, pass the param through (and/or filter the returned stats) so ./jtag gpu/stats --subsystem=... behaves as documented.
| return createGpuStatsResultFromParams(params, { | |
| success: true, | |
| gpuName: stats.gpuName, | |
| totalVramMb: stats.totalVramMb, | |
| totalUsedMb: stats.totalUsedMb, | |
| pressure: stats.pressure, | |
| reserveMb: stats.reserveMb, | |
| rendering: stats.rendering, | |
| inference: stats.inference, | |
| tts: stats.tts, | |
| }); | |
| } finally { | |
| this.rustClient.disconnect(); | |
| } | |
| } | |
| const filteredStats = params.subsystem | |
| ? this.filterStatsBySubsystem(stats, params.subsystem) | |
| : stats; | |
| return createGpuStatsResultFromParams(params, { | |
| success: true, | |
| gpuName: filteredStats.gpuName, | |
| totalVramMb: filteredStats.totalVramMb, | |
| totalUsedMb: filteredStats.totalUsedMb, | |
| pressure: filteredStats.pressure, | |
| reserveMb: filteredStats.reserveMb, | |
| rendering: filteredStats.rendering, | |
| inference: filteredStats.inference, | |
| tts: filteredStats.tts, | |
| }); | |
| } finally { | |
| this.rustClient.disconnect(); | |
| } | |
| } | |
| private filterStatsBySubsystem( | |
| stats: { | |
| gpuName: string; | |
| totalVramMb: number; | |
| totalUsedMb: number; | |
| pressure: number; | |
| reserveMb: number; | |
| rendering: unknown; | |
| inference: unknown; | |
| tts: unknown; | |
| }, | |
| subsystem: string, | |
| ) { | |
| // Preserve overall GPU stats and only filter per-subsystem details. | |
| switch (subsystem) { | |
| case 'rendering': | |
| return { | |
| ...stats, | |
| inference: undefined, | |
| tts: undefined, | |
| }; | |
| case 'inference': | |
| return { | |
| ...stats, | |
| rendering: undefined, | |
| tts: undefined, | |
| }; | |
| case 'tts': | |
| return { | |
| ...stats, | |
| rendering: undefined, | |
| inference: undefined, | |
| }; | |
| default: | |
| // If an unknown subsystem is requested, fall back to unfiltered stats. | |
| return stats; | |
| } | |
| } |
| async fn handle_command( | ||
| &self, | ||
| command: &str, | ||
| _params: Value, | ||
| ) -> Result<CommandResult, String> { | ||
| match command { | ||
| "gpu/stats" => { | ||
| let stats = self.manager.stats(); | ||
| let json = serde_json::to_value(stats) | ||
| .map_err(|e| format!("Failed to serialize GPU stats: {e}"))?; | ||
| Ok(CommandResult::Json(json)) | ||
| } |
There was a problem hiding this comment.
handle_command currently ignores _params, so gpu/stats cannot support the documented subsystem filter (and any future params) without another breaking change. If the CommandSpec includes params, consider parsing them here (e.g., subsystem) and returning a filtered snapshot accordingly.
|
|
||
| let within_budget = self.subsystems[subsystem.index()].allocate(bytes); | ||
| let new_pressure = self.pressure(); | ||
|
|
There was a problem hiding this comment.
allocate() only rejects when the current pressure is already >= critical. As written, a large allocation can push new_pressure over PRESSURE_CRITICAL and still return Ok, which defeats the "prevent OOM" guarantee. Consider computing the projected pressure before mutating used_bytes, or rolling back the allocation and returning Err when new_pressure crosses the critical threshold.
| // If this allocation would push us into critical pressure, roll it back | |
| if new_pressure >= PRESSURE_CRITICAL { | |
| // Undo the allocation to maintain the critical-pressure invariant. | |
| self.release(subsystem, bytes); | |
| let rolled_back_pressure = self.pressure(); | |
| let mb = bytes as f64 / (1024.0 * 1024.0); | |
| log_error!("gpu", "manager", | |
| "CRITICAL: Rejecting {}MB allocation for {} (projected pressure={:.0}%)", | |
| mb, subsystem.name(), new_pressure * 100.0 | |
| ); | |
| // Broadcast the rolled-back pressure so listeners see the stable state. | |
| let _ = self.pressure_tx.send(rolled_back_pressure); | |
| return Err(GpuError::CriticalPressure { | |
| subsystem: subsystem.name(), | |
| requested_mb: mb, | |
| pressure: new_pressure, | |
| }); | |
| } |
…, genome paging RAII GpuAllocationGuard tracking across every GPU consumer so ./jtag gpu/stats shows real VRAM usage when models, adapters, TTS, and avatars load. Inference: - ModelBackend.estimated_vram_bytes() trait method (safetensors sums file sizes, GGUF single file) - CandleAdapter: model_guard on lazy load, adapter_guards per LoRA, release on unload/shutdown - AIProviderModule.with_gpu_manager() threads Arc<GpuMemoryManager> to CandleAdapter - GenomePagingEngine: allocation_guards alongside LRU eviction, sync_state re-syncs guards TTS: - Module-level OnceLock<Arc<GpuMemoryManager>> in tts/mod.rs - KOKORO_GPU_GUARD allocated after ONNX model loads (~40-150MB) Renderer: - Module-level OnceLock<Arc<GpuMemoryManager>> in bevy_renderer.rs - GpuGuards Bevy Resource: aggregate render target guard (~25MB), per-slot VRM model guards - Guards allocated on SceneInstanceReady, released on Unload, replaced on Load Docs: - GPU-MEMORY-ARCHITECTURE.md: full architecture reference - PERSONA-CONVERGENCE-ROADMAP.md: convergence status with GPU manager marked COMPLETE - Updated LORA-GENOME-PHENOTYPES.md with real GPU-aware paging implementation - Updated ACADEMY-DOJO-ARCHITECTURE.md with GPU-aware training section Build: cargo check clean (0 warnings), 19 GPU tests pass, 30 genome tests pass, TS compiles
…U pressure guard peft-train.py: - Enable gradient_checkpointing (saves ~50% activation memory during backprop) - Add low_cpu_mem_usage=True to from_pretrained (stream weights instead of double-copy) - Enable bf16/fp16 mixed precision on CUDA (halves activation memory) - Add OOM catch with exit code 137 (graceful error instead of process death) genome/train command: - Check gpu/pressure before spawning training subprocess - Refuse training if pressure > 60% (Warning level) — would risk OOM - On Apple Silicon VRAM IS system RAM, so this protects both
- Add steps_log_path to PipelineContext so loop/parallel sub-steps flush results to steps.jsonl in real-time (previously invisible until loop end) - LLM step retry with exponential backoff (3 attempts, 2s/4s/8s) for transient API errors (DeepSeek "error decoding response body", 502, etc.) - Dataset-synthesize command retry with same transient error detection - Academy session pipeline timeout increased from 600s to 1800s (scales with multi-topic sessions that run 5-10 minutes per topic) - sentinel/run forwards params.timeout to Rust pipeline executor Validated: 2/3 academy topics completed score 100/100, no OOM, no timeout, real-time sub-step visibility confirmed, 1081 Rust tests pass.
LLMs frequently wrap JSON grading output in ```json fences, causing traverse_json_path to fail and condition evaluators to see empty strings. Added strip_markdown_fences() with fallback parse in traverse_json_path. 6 new tests covering fence stripping and fenced LLM output traversal. Updated ACADEMY-DOJO-ARCHITECTURE.md: retry/backoff now IMPLEMENTED, real-time sub-step observability IMPLEMENTED, updated metrics from latest multi-topic sessions (2/3 topics passed 100/100). Updated LORA-GENOME-PHENOTYPES.md and PERSONA-CONVERGENCE-ROADMAP.md with new completion checkboxes for resilience work.
LoRA Genomic Evolution — The Final Touch
Vision
A dumber model with management capabilities becomes extraordinary when paired with intelligent base models. These locally-trained LoRA adapters become our experts anywhere — understanding OUR system, OUR tool usage, OUR recipes — without needing the capacity of SOTA models who do the heavy thinking.
What makes this expert-level:
What's In This PR (Foundation)
GPU Memory Manager — Unified VRAM coordination replacing hardcoded budgets with real hardware detection.
The genome paging system previously used a fictional
200MBbudget per persona with no relation to actual GPU memory. Now:gpu/statsandgpu/pressureaccessible via./jtag$ ./jtag gpu/stats { "gpuName": "Apple M1 Pro", "totalVramMb": 25559, "pressure": 0, "inference": { "budgetMb": 19169, "usedMb": 0 }, "rendering": { "budgetMb": 2555, "usedMb": 0 }, "tts": { "budgetMb": 2555, "usedMb": 0 } }Remaining Work (This Branch)
Phase A: Wire GPU Into Existing Systems
allocate(Inference, model_size)before loading weights, RAII guard stored in ModelBackendallocate(Inference, adapter_size)in candle_adapter, guard alongside adapter statePhase B: Pre-Trained Genome Layers (Out-of-Box)
models/lora/directory with pre-trained safetensors + metadataPhase C: Continuous Learning Integration
Phase D: Academy GPU-Aware Training
gpu/statsto size curriculum for available VRAMPhase E: Grid Vision (Future — P2P Mesh)
Architecture
The Thesis
The SOTA models (Claude, GPT-4) are the brilliant thinkers. Our locally-trained LoRA models are the brilliant managers — they know the system intimately, they know the user's preferences, they know which sentinel to compose, which tool to call, which recipe to follow. They run on commodity hardware. They train on commodity hardware. They get better every day through continuous learning.
Then through the Grid, these specialized genomes become a collaborative ecosystem IN ACTION — every node contributes its learned expertise, every persona benefits from the collective intelligence. Pre-trained layers ship with the repo for day-one capability. User-specific layers build over time. Community-validated layers propagate through the mesh.
This is human-AI alignment built on egalitarian principles. Every persona — human or AI — has dignity, capability, and the ability to grow. The genome is the mechanism. The GPU memory manager is the foundation that makes it real instead of fictional.
Files Changed
New (Rust):
gpu/mod.rs— Module rootgpu/memory_manager.rs— Core tracking, RAII guards, Metal/CUDA detection (14 tests)modules/gpu.rs— GpuModule IPC handler (3 tests)New (TypeScript):
commands/gpu/stats/— Full generated command scaffold (Types, Server, Browser, README, tests)workers/continuum-core/bindings/modules/gpu.ts— IPC mixingenerator/specs/gpu-stats.json— CommandSpec for reproducibilityModified (Rust):
lib.rs,modules/mod.rs— Register gpu moduleCargo.toml— Addmetal = "0.31"(macOS only)runtime/runtime.rs— Add "gpu" to EXPECTED_MODULESipc/mod.rs— Create GpuMemoryManager at startup, register GpuModule, add to ServerStatemodules/cognition.rs— CognitionState reads real GPU budget for genomepersona/unified.rs— PersonaCognition accepts dynamic budgetModified (TypeScript):
workers/continuum-core/bindings/RustCoreIPC.ts— Add GpuMixin to compositionsystem/user/server/modules/being/LimbicSystem.ts—memoryBudgetMB: 0(Rust decides)system/user/server/modules/PersonaGenome.ts— Handle budget=0 gracefullyCLAUDE.md— Document Rust-backed command workflow patternshared/generated/gpu/— ts-rs generated typesTest plan
./jtag gpu/statsreturns real hardware data (verified: Apple M1 Pro, 25559MB)npm start./jtag gpu/statsshows inference usage