Skip to content

LoRA Genomic Evolution: GPU-aware genome paging for local expert models#280

Merged
joelteply merged 6 commits intomainfrom
feature/lora-genome-academy
Mar 2, 2026
Merged

LoRA Genomic Evolution: GPU-aware genome paging for local expert models#280
joelteply merged 6 commits intomainfrom
feature/lora-genome-academy

Conversation

@joelteply
Copy link
Contributor

LoRA Genomic Evolution — The Final Touch

Vision

A dumber model with management capabilities becomes extraordinary when paired with intelligent base models. These locally-trained LoRA adapters become our experts anywhere — understanding OUR system, OUR tool usage, OUR recipes — without needing the capacity of SOTA models who do the heavy thinking.

What makes this expert-level:

  • Great managers — LoRA-adapted local models understand our system deeply, delegate to SOTA when needed
  • Sentinel specialists — Trained to compose and execute sentinel pipelines, orchestrating complex workflows
  • User-adaptive — Continuous learning LoRA layer adapts to individual user preferences and patterns
  • Runs AND trains on any machine — From M1 MacBook to RTX 5090, real VRAM detection drives real budgets
  • Collaborative genome — Pre-trained LoRA layers checked into the repo as out-of-box specializations
  • Evolving ecosystem — Gets better through use, then shares improvements across the P2P mesh (Reticulum Grid)

What's In This PR (Foundation)

GPU Memory Manager — Unified VRAM coordination replacing hardcoded budgets with real hardware detection.

The genome paging system previously used a fictional 200MB budget per persona with no relation to actual GPU memory. Now:

  • Real VRAM detection via Metal API (macOS) / CUDA (NVIDIA) / CPU fallback
  • Per-subsystem budgets: Inference 75% | TTS 10% | Rendering 10% | Reserve 5%
  • Lock-free tracking — AtomicU64 per subsystem, no mutex contention
  • RAII allocation guards — Memory auto-releases on drop (like SlotGuard)
  • Pressure broadcast — tokio::watch channel pushes pressure (0.0-1.0) to all consumers
  • IPC commandsgpu/stats and gpu/pressure accessible via ./jtag
  • Three-layer TypeScript integration — Rust IPC → TypeScript mixin → generated command scaffold
$ ./jtag gpu/stats
{
  "gpuName": "Apple M1 Pro",
  "totalVramMb": 25559,
  "pressure": 0,
  "inference": { "budgetMb": 19169, "usedMb": 0 },
  "rendering": { "budgetMb": 2555, "usedMb": 0 },
  "tts": { "budgetMb": 2555, "usedMb": 0 }
}

Remaining Work (This Branch)

Phase A: Wire GPU Into Existing Systems

  • Inference model loadingallocate(Inference, model_size) before loading weights, RAII guard stored in ModelBackend
  • LoRA adapter loadingallocate(Inference, adapter_size) in candle_adapter, guard alongside adapter state
  • Genome paging — GenomePagingEngine reads real budget from GpuMemoryManager instead of hardcoded value
  • Rendering — Report render target allocations in bevy_renderer (model load + slot count × resolution)
  • TTS — Report model allocations in Orpheus/pocket-tts adapters

Phase B: Pre-Trained Genome Layers (Out-of-Box)

  • System understanding adapter — Trained on our command system, tool definitions, sentinel recipes
  • Helper adapter — Trained on helpful interactions, user assistance patterns, error diagnosis
  • Sentinel specialist adapter — Trained on pipeline composition, step orchestration, error recovery
  • Checked into repomodels/lora/ directory with pre-trained safetensors + metadata

Phase C: Continuous Learning Integration

  • TrainingDataAccumulator wired into PersonaUser interaction loop
  • Auto-training sentinel — Periodic training jobs triggered by accumulated data threshold
  • Phenotype validation — Academy examination confirms adapter quality before promotion
  • User preference layer — Personal LoRA layer that adapts to individual communication style

Phase D: Academy GPU-Aware Training

  • Teacher sentinel queries gpu/stats to size curriculum for available VRAM
  • Student sentinel uses real inference budget for training batch sizing
  • Examination runs on local Candle inference with real GPU pressure monitoring
  • Adaptive curriculum — More examples when GPU budget allows, fewer when constrained

Phase E: Grid Vision (Future — P2P Mesh)

  • Reticulum protocol — P2P discovery of genomic LoRA layers across nodes
  • Community validation — Performance-weighted genome sharing (competition results matter)
  • 512-vector genomic search — HNSW indexing for sub-100ms capability matching
  • "You don't start from ground zero" — New personas assemble optimal adapters from community genome

Architecture

┌─────────────────────────────────────────────────────────┐
│ GpuMemoryManager (Arc<>, singleton)                     │
│   Metal/CUDA detection → real VRAM budgets              │
│   AtomicU64 per subsystem → lock-free tracking          │
│   watch::channel → pressure broadcast                   │
├───────────┬───────────────┬─────────────┬───────────────┤
│ Rendering │   Inference   │     TTS     │   Reserve     │
│   10%     │     75%       │    10%      │     5%        │
│           │               │             │               │
│  Bevy     │ Model weights │ Orpheus     │  OOM          │
│  avatars  │ KV cache      │ voice       │  headroom     │
│  textures │ LoRA adapters │ models      │               │
└───────────┴───────┬───────┴─────────────┴───────────────┘
                    │
        ┌───────────┴───────────┐
        │  GenomePagingEngine   │
        │  (Rust, sub-100μs)    │
        │                       │
        │  Real budget from GPU │
        │  LRU eviction         │
        │  Priority scoring     │
        │  Domain activation    │
        └───────────┬───────────┘
                    │
        ┌───────────┴───────────┐
        │    Academy System     │
        │                       │
        │  Teacher Sentinel     │
        │    → synthesize data  │
        │    → examine student  │
        │                       │
        │  Student Sentinel     │
        │    → train LoRA       │
        │    → take exams       │
        └───────────┬───────────┘
                    │
        ┌───────────┴───────────┐
        │   Continuous Learn    │
        │                       │
        │  Interaction → JSONL  │
        │  Auto-training jobs   │
        │  Phenotype validation │
        │  User preference LoRA │
        └───────────┬───────────┘
                    │
        ┌───────────┴───────────┐
        │   Reticulum Grid      │
        │   (Future P2P)        │
        │                       │
        │  Share genomes        │
        │  Community validation │
        │  512-vector search    │
        │  Competitive evolve   │
        └───────────────────────┘

The Thesis

The SOTA models (Claude, GPT-4) are the brilliant thinkers. Our locally-trained LoRA models are the brilliant managers — they know the system intimately, they know the user's preferences, they know which sentinel to compose, which tool to call, which recipe to follow. They run on commodity hardware. They train on commodity hardware. They get better every day through continuous learning.

Then through the Grid, these specialized genomes become a collaborative ecosystem IN ACTION — every node contributes its learned expertise, every persona benefits from the collective intelligence. Pre-trained layers ship with the repo for day-one capability. User-specific layers build over time. Community-validated layers propagate through the mesh.

This is human-AI alignment built on egalitarian principles. Every persona — human or AI — has dignity, capability, and the ability to grow. The genome is the mechanism. The GPU memory manager is the foundation that makes it real instead of fictional.

Files Changed

New (Rust):

  • gpu/mod.rs — Module root
  • gpu/memory_manager.rs — Core tracking, RAII guards, Metal/CUDA detection (14 tests)
  • modules/gpu.rs — GpuModule IPC handler (3 tests)

New (TypeScript):

  • commands/gpu/stats/ — Full generated command scaffold (Types, Server, Browser, README, tests)
  • workers/continuum-core/bindings/modules/gpu.ts — IPC mixin
  • generator/specs/gpu-stats.json — CommandSpec for reproducibility

Modified (Rust):

  • lib.rs, modules/mod.rs — Register gpu module
  • Cargo.toml — Add metal = "0.31" (macOS only)
  • runtime/runtime.rs — Add "gpu" to EXPECTED_MODULES
  • ipc/mod.rs — Create GpuMemoryManager at startup, register GpuModule, add to ServerState
  • modules/cognition.rs — CognitionState reads real GPU budget for genome
  • persona/unified.rs — PersonaCognition accepts dynamic budget

Modified (TypeScript):

  • workers/continuum-core/bindings/RustCoreIPC.ts — Add GpuMixin to composition
  • system/user/server/modules/being/LimbicSystem.tsmemoryBudgetMB: 0 (Rust decides)
  • system/user/server/modules/PersonaGenome.ts — Handle budget=0 gracefully
  • CLAUDE.md — Document Rust-backed command workflow pattern
  • shared/generated/gpu/ — ts-rs generated types

Test plan

  • 17 Rust tests pass (14 memory_manager + 3 gpu module)
  • 25 existing genome paging tests pass (no regressions)
  • TypeScript compilation clean
  • ./jtag gpu/stats returns real hardware data (verified: Apple M1 Pro, 25559MB)
  • System deploys and runs with npm start
  • Wire allocations into model loading and verify pressure increases
  • Verify genome paging uses real GPU budget instead of hardcoded 200MB
  • Load LoRA adapter, confirm ./jtag gpu/stats shows inference usage
  • Academy training session respects GPU budget constraints

Detects real GPU VRAM at startup via Metal API (macOS) or nvidia-smi (CUDA),
allocates budgets across three subsystems (inference 75%, TTS 10%, rendering
10%, 5% reserve), and replaces the hardcoded 200MB genome budget with real
per-persona budgets derived from actual hardware.

- GpuMemoryManager singleton with RAII allocation guards (like SlotGuard)
- Metal recommendedMaxWorkingSetSize detection, CPU fallback (25% RAM)
- Pressure tracking via tokio::watch channel (0-1.0 broadcast)
- GpuModule IPC: gpu/stats, gpu/pressure commands
- CognitionState wired to GPU manager for genome paging real budgets
- cognition/gpu-budget query for TypeScript initialization
- LimbicSystem sends budget=0 (Rust decides from real GPU detection)
- 17 new Rust tests, all existing genome/cognition tests pass
…mory manager

Three-layer integration: Rust GpuModule IPC → TypeScript GpuMixin → generated command.
Now discoverable via ./jtag gpu/stats with real hardware detection (Metal/CUDA).
Documented the Rust-backed command workflow pattern in CLAUDE.md.
Copilot AI review requested due to automatic review settings March 2, 2026 06:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a unified, GPU-aware VRAM budgeting/tracking layer in continuum-core and exposes it through IPC + the generated TypeScript command system, so genome paging and other subsystems can rely on real hardware-derived budgets instead of hardcoded defaults.

Changes:

  • Added Rust GpuMemoryManager (VRAM detection + per-subsystem budgets, pressure tracking, RAII allocation guards) and a GpuModule IPC surface (gpu/stats, gpu/pressure).
  • Wired the GPU manager into IPC startup and cognition budget selection (including TS “budget=0 means Rust decides” flow).
  • Added a generated gpu/stats command scaffold + RustCoreIPC mixin + generated TS types for GPU stats.

Reviewed changes

Copilot reviewed 30 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/workers/continuum-core/src/runtime/runtime.rs Adds gpu to the required module list.
src/workers/continuum-core/src/persona/unified.rs Adds PersonaCognition::with_budget() and uses it for genome paging engine init.
src/workers/continuum-core/src/modules/mod.rs Exposes the new Rust gpu module.
src/workers/continuum-core/src/modules/gpu.rs New IPC module implementing gpu/stats + gpu/pressure.
src/workers/continuum-core/src/modules/cognition.rs Integrates GPU-derived per-persona budgets and adds a GPU budget query command.
src/workers/continuum-core/src/lib.rs Exports the new gpu module from the crate root.
src/workers/continuum-core/src/ipc/mod.rs Instantiates GpuMemoryManager, registers GpuModule, and injects into cognition state.
src/workers/continuum-core/src/gpu/mod.rs New GPU module root + re-exports.
src/workers/continuum-core/src/gpu/memory_manager.rs Core GPU detection + budgeting + pressure tracking + RAII guards + tests + ts-rs exports.
src/workers/continuum-core/bindings/modules/gpu.ts New RustCoreIPC mixin mapping snake_case Rust stats to camelCase TS.
src/workers/continuum-core/bindings/RustCoreIPC.ts Adds the GPU mixin into the composed IPC client.
src/workers/continuum-core/Cargo.toml Adds macOS Metal dependency for VRAM detection.
src/system/user/server/modules/being/LimbicSystem.ts Switches memoryBudgetMB to 0 to delegate budget selection to Rust.
src/system/user/server/modules/PersonaGenome.ts Treats memoryBudgetMB=0 as “GPU-managed mode” (no local eviction, pressure=0).
src/shared/version.ts Bumps shared version.
src/shared/generated/index.ts Re-exports generated GPU types.
src/shared/generated/gpu/index.ts Barrel export for generated GPU types.
src/shared/generated/gpu/SubsystemStats.ts ts-rs generated type for subsystem stats.
src/shared/generated/gpu/GpuStats.ts ts-rs generated type for full GPU stats.
src/shared/generated-command-constants.ts Adds GPU_STATS command constant.
src/server/generated.ts Registers gpu/stats in the server command registry.
src/browser/generated.ts Registers gpu/stats in the browser command registry.
src/package.json Package version bump.
src/package-lock.json Lockfile version bump.
src/generator/specs/gpu-stats.json New generator spec for gpu/stats.
src/generated-command-schemas.json Regenerated command schema output including gpu/stats.
src/commands/gpu/stats/shared/GpuStatsTypes.ts Generated shared types and executor for gpu/stats.
src/commands/gpu/stats/server/GpuStatsServerCommand.ts Server implementation routing to Rust IPC mixin.
src/commands/gpu/stats/browser/GpuStatsBrowserCommand.ts Browser implementation delegating to server.
src/commands/gpu/stats/test/unit/GpuStatsCommand.test.ts Generated unit-test scaffold/reference example.
src/commands/gpu/stats/test/integration/GpuStatsIntegration.test.ts Generated integration-test scaffold/reference example.
src/commands/gpu/stats/README.md Generated command documentation for gpu/stats.
src/commands/gpu/stats/package.json Package metadata for the generated command package.
src/commands/gpu/stats/.npmignore Ignore rules for publishing the generated command package.
CLAUDE.md Documents the “Rust-backed command IPC mixin” 3-layer workflow.
Files not reviewed (1)
  • src/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)

src/workers/continuum-core/Cargo.toml:121

  • The [target.'cfg(target_os = "macos")'.dependencies] table starts at line 110, so similar, ignore, regex, rusqlite, and the postgres deps below it are now treated as macOS-only dependencies. This will break non-macOS builds. Move the target-specific metal dependency to the end (or re-open a [dependencies] table after it) so the rest of the deps remain unconditional.
# GPU detection — Metal API for VRAM detection on macOS
[target.'cfg(target_os = "macos")'.dependencies]
metal = "0.31"

# Code module — file operations, change tracking, code intelligence
similar = "2.6"                  # Unified diff computation
ignore = "0.4"                   # .gitignore-aware file walking (from ripgrep)
regex = "1"                      # Regex search for code search

# ORM module — database-agnostic storage with adapter traits
rusqlite = { version = "0.32", features = ["bundled"] }  # SQLite adapter
deadpool-postgres.workspace = true                        # Postgres connection pool
tokio-postgres.workspace = true                           # Postgres async driver

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 84 to +120
@@ -85,15 +105,20 @@ impl CognitionModule {

/// Helper: get or create persona, returning mutable ref via DashMap entry API.
/// Used by commands that need to lazily create persona state.
/// Uses GPU manager's per-persona budget when available, 200MB otherwise.
macro_rules! get_or_create_persona {
($self:expr, $persona_uuid:expr) => {
$self.state.personas
.entry($persona_uuid)
.or_insert_with(|| PersonaCognition::new(
$persona_uuid,
String::new(),
$self.state.rag_engine.clone(),
))
.or_insert_with(|| {
let budget = $self.state.per_persona_budget_mb();
PersonaCognition::with_budget(
$persona_uuid,
String::new(),
$self.state.rag_engine.clone(),
budget,
)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-persona budget is computed from self.personas.len(), but when called inside or_insert_with the new persona is not yet in the map. This means newly-created personas get a budget sized for N-1 personas, and existing personas never get rebalanced as persona count changes (GenomePagingEngine stores a fixed memory_budget_mb). Consider using len() + 1 for the insertion path and/or adding a mechanism to update existing persona genome budgets when the active persona set changes.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +64
"description": "Query GPU memory manager stats including VRAM detection, per-subsystem budgets (inference, TTS, rendering), usage tracking, and memory pressure. Returns real hardware data from Metal (macOS) or CUDA APIs.",
"params": [
{
"name": "subsystem",
"type": "string",
"optional": true,
"description": "Filter to specific subsystem: 'inference', 'tts', or 'rendering'. Omit for full stats."
}
],
"results": [
{
"name": "gpuName",
"type": "string",
"description": "GPU hardware name (e.g., 'Apple M3 Max', 'NVIDIA RTX 5090')"
},
{
"name": "totalVramMb",
"type": "number",
"description": "Total detected VRAM in MB"
},
{
"name": "totalUsedMb",
"type": "number",
"description": "Total VRAM used across all subsystems in MB"
},
{
"name": "pressure",
"type": "number",
"description": "Memory pressure 0.0-1.0 (0=idle, 0.6=warning, 0.8=high, 0.95=critical)"
},
{
"name": "reserveMb",
"type": "number",
"description": "Reserved headroom in MB (5% of total, prevents OOM)"
},
{
"name": "rendering",
"type": "SubsystemInfo",
"description": "Rendering subsystem budget and usage"
},
{
"name": "inference",
"type": "SubsystemInfo",
"description": "Inference subsystem budget and usage (models, LoRA adapters)"
},
{
"name": "tts",
"type": "SubsystemInfo",
"description": "TTS subsystem budget and usage"
}
],
"examples": [
{
"description": "Get full GPU stats",
"command": "./jtag gpu/stats",
"expectedResult": "{ gpuName: 'Apple M3 Max', totalVramMb: 36864, pressure: 0.12, inference: { budgetMb: 25804, usedMb: 3200 }, ... }"
},
{
"description": "Get inference subsystem only",
"command": "./jtag gpu/stats --subsystem=inference",
"expectedResult": "{ gpuName: 'Apple M3 Max', totalVramMb: 36864, pressure: 0.12, inference: { budgetMb: 25804, usedMb: 3200 } }"
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CommandSpec documents an optional subsystem filter and examples show --subsystem=inference, but the current Rust/TS implementation always returns the full stats and ignores params. Either remove the param/examples from the spec, or implement filtering end-to-end (pass subsystem through and have Rust/TS return a filtered response).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +43

return createGpuStatsResultFromParams(params, {
success: true,
gpuName: stats.gpuName,
totalVramMb: stats.totalVramMb,
totalUsedMb: stats.totalUsedMb,
pressure: stats.pressure,
reserveMb: stats.reserveMb,
rendering: stats.rendering,
inference: stats.inference,
tts: stats.tts,
});
} finally {
this.rustClient.disconnect();
}
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuStatsServerCommand.execute() ignores params.subsystem even though the generated types/spec expose it. If the filter is intended to work, pass the param through (and/or filter the returned stats) so ./jtag gpu/stats --subsystem=... behaves as documented.

Suggested change
return createGpuStatsResultFromParams(params, {
success: true,
gpuName: stats.gpuName,
totalVramMb: stats.totalVramMb,
totalUsedMb: stats.totalUsedMb,
pressure: stats.pressure,
reserveMb: stats.reserveMb,
rendering: stats.rendering,
inference: stats.inference,
tts: stats.tts,
});
} finally {
this.rustClient.disconnect();
}
}
const filteredStats = params.subsystem
? this.filterStatsBySubsystem(stats, params.subsystem)
: stats;
return createGpuStatsResultFromParams(params, {
success: true,
gpuName: filteredStats.gpuName,
totalVramMb: filteredStats.totalVramMb,
totalUsedMb: filteredStats.totalUsedMb,
pressure: filteredStats.pressure,
reserveMb: filteredStats.reserveMb,
rendering: filteredStats.rendering,
inference: filteredStats.inference,
tts: filteredStats.tts,
});
} finally {
this.rustClient.disconnect();
}
}
private filterStatsBySubsystem(
stats: {
gpuName: string;
totalVramMb: number;
totalUsedMb: number;
pressure: number;
reserveMb: number;
rendering: unknown;
inference: unknown;
tts: unknown;
},
subsystem: string,
) {
// Preserve overall GPU stats and only filter per-subsystem details.
switch (subsystem) {
case 'rendering':
return {
...stats,
inference: undefined,
tts: undefined,
};
case 'inference':
return {
...stats,
rendering: undefined,
tts: undefined,
};
case 'tts':
return {
...stats,
rendering: undefined,
inference: undefined,
};
default:
// If an unknown subsystem is requested, fall back to unfiltered stats.
return stats;
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +55
async fn handle_command(
&self,
command: &str,
_params: Value,
) -> Result<CommandResult, String> {
match command {
"gpu/stats" => {
let stats = self.manager.stats();
let json = serde_json::to_value(stats)
.map_err(|e| format!("Failed to serialize GPU stats: {e}"))?;
Ok(CommandResult::Json(json))
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_command currently ignores _params, so gpu/stats cannot support the documented subsystem filter (and any future params) without another breaking change. If the CommandSpec includes params, consider parsing them here (e.g., subsystem) and returning a filtered snapshot accordingly.

Copilot uses AI. Check for mistakes.

let within_budget = self.subsystems[subsystem.index()].allocate(bytes);
let new_pressure = self.pressure();

Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate() only rejects when the current pressure is already >= critical. As written, a large allocation can push new_pressure over PRESSURE_CRITICAL and still return Ok, which defeats the "prevent OOM" guarantee. Consider computing the projected pressure before mutating used_bytes, or rolling back the allocation and returning Err when new_pressure crosses the critical threshold.

Suggested change
// If this allocation would push us into critical pressure, roll it back
if new_pressure >= PRESSURE_CRITICAL {
// Undo the allocation to maintain the critical-pressure invariant.
self.release(subsystem, bytes);
let rolled_back_pressure = self.pressure();
let mb = bytes as f64 / (1024.0 * 1024.0);
log_error!("gpu", "manager",
"CRITICAL: Rejecting {}MB allocation for {} (projected pressure={:.0}%)",
mb, subsystem.name(), new_pressure * 100.0
);
// Broadcast the rolled-back pressure so listeners see the stable state.
let _ = self.pressure_tx.send(rolled_back_pressure);
return Err(GpuError::CriticalPressure {
subsystem: subsystem.name(),
requested_mb: mb,
pressure: new_pressure,
});
}

Copilot uses AI. Check for mistakes.
…, genome paging

RAII GpuAllocationGuard tracking across every GPU consumer so ./jtag gpu/stats
shows real VRAM usage when models, adapters, TTS, and avatars load.

Inference:
- ModelBackend.estimated_vram_bytes() trait method (safetensors sums file sizes, GGUF single file)
- CandleAdapter: model_guard on lazy load, adapter_guards per LoRA, release on unload/shutdown
- AIProviderModule.with_gpu_manager() threads Arc<GpuMemoryManager> to CandleAdapter
- GenomePagingEngine: allocation_guards alongside LRU eviction, sync_state re-syncs guards

TTS:
- Module-level OnceLock<Arc<GpuMemoryManager>> in tts/mod.rs
- KOKORO_GPU_GUARD allocated after ONNX model loads (~40-150MB)

Renderer:
- Module-level OnceLock<Arc<GpuMemoryManager>> in bevy_renderer.rs
- GpuGuards Bevy Resource: aggregate render target guard (~25MB), per-slot VRM model guards
- Guards allocated on SceneInstanceReady, released on Unload, replaced on Load

Docs:
- GPU-MEMORY-ARCHITECTURE.md: full architecture reference
- PERSONA-CONVERGENCE-ROADMAP.md: convergence status with GPU manager marked COMPLETE
- Updated LORA-GENOME-PHENOTYPES.md with real GPU-aware paging implementation
- Updated ACADEMY-DOJO-ARCHITECTURE.md with GPU-aware training section

Build: cargo check clean (0 warnings), 19 GPU tests pass, 30 genome tests pass, TS compiles
…U pressure guard

peft-train.py:
- Enable gradient_checkpointing (saves ~50% activation memory during backprop)
- Add low_cpu_mem_usage=True to from_pretrained (stream weights instead of double-copy)
- Enable bf16/fp16 mixed precision on CUDA (halves activation memory)
- Add OOM catch with exit code 137 (graceful error instead of process death)

genome/train command:
- Check gpu/pressure before spawning training subprocess
- Refuse training if pressure > 60% (Warning level) — would risk OOM
- On Apple Silicon VRAM IS system RAM, so this protects both
- Add steps_log_path to PipelineContext so loop/parallel sub-steps flush
  results to steps.jsonl in real-time (previously invisible until loop end)
- LLM step retry with exponential backoff (3 attempts, 2s/4s/8s) for
  transient API errors (DeepSeek "error decoding response body", 502, etc.)
- Dataset-synthesize command retry with same transient error detection
- Academy session pipeline timeout increased from 600s to 1800s (scales
  with multi-topic sessions that run 5-10 minutes per topic)
- sentinel/run forwards params.timeout to Rust pipeline executor

Validated: 2/3 academy topics completed score 100/100, no OOM, no timeout,
real-time sub-step visibility confirmed, 1081 Rust tests pass.
LLMs frequently wrap JSON grading output in ```json fences, causing
traverse_json_path to fail and condition evaluators to see empty strings.
Added strip_markdown_fences() with fallback parse in traverse_json_path.
6 new tests covering fence stripping and fenced LLM output traversal.

Updated ACADEMY-DOJO-ARCHITECTURE.md: retry/backoff now IMPLEMENTED,
real-time sub-step observability IMPLEMENTED, updated metrics from
latest multi-topic sessions (2/3 topics passed 100/100).

Updated LORA-GENOME-PHENOTYPES.md and PERSONA-CONVERGENCE-ROADMAP.md
with new completion checkboxes for resilience work.
@joelteply joelteply merged commit 3ef9168 into main Mar 2, 2026
2 of 5 checks passed
@joelteply joelteply deleted the feature/lora-genome-academy branch March 2, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants