feat: per-message metadata bar (tokens, speed, model, pressure)#7
Merged
feat: per-message metadata bar (tokens, speed, model, pressure)#7
Conversation
Each AI response now shows a subtle info bar below it with: - Model name (short form, e.g. "Qwen3-30B-A3B-4bit") - Token count (completion_tokens from response) - Speed (tok/s — from mlx_flash_compress data or calculated) - Elapsed time - Memory pressure indicator (green/yellow/red badge) Metadata fetched from /status after each generation. Styled as dim text below the response — visible but not intrusive.
Pressure detection: - New from_system_state() uses available_gb (free + reclaimable) instead of just free_gb — avoids false Critical on systems with large inactive pools - Accounts for swap: >8GB swap with <20% available = Critical - Your system (8GB available, 13GB swap) correctly shows Warning instead of false Critical; at 3.5GB available it correctly escalates to Critical - Tests for real-world scenario (36GB Mac, heavy swap) Chat metadata: - Fix duplicate "tok/s tok/s" display — checks if suffix already present
Token counter: - Rust proxy now extracts completion_tokens from Python response and increments tokens_generated_total (was always 0) - Dashboard falls back to Python worker aggregated total when Rust counter is 0 Cache panel: - Shows "not enabled (start with --expert-dir)" instead of "--%/0" - Proper N/A display when expert cache is not configured Pressure detection: - from_system_state() considers available_gb + swap instead of just free_gb - 8.2GB available + 12.8GB swap on 36GB Mac → Warning (was false Critical) - Tests for real-world scenario Chat metadata: - Fix duplicate "tok/s tok/s" display
…t cache to 2GB - Auto-detects expert dir from ~/.cache/huggingface/hub/models--<org>--<model>/snapshots/ - Scans for .safetensors files in latest snapshot directory - Cache enables automatically when model is downloaded via HuggingFace - No --expert-dir flag needed for standard HF models - Default cache size increased from 512MB to 2048MB (better for 36GB+ Macs) - Logs the auto-detected path for transparency
Dashboard: - GPU (Metal) card: device utilization %, renderer %, tiler %, GPU memory - Chart history persists across page reload (localStorage) - Token counter uses max(Rust, Python worker) — always shows real total - Cache panel shows "allocated (standard inference mode)" when no expert streaming Prometheus /metrics: - mlx_flash_gpu_utilization_pct, gpu_renderer_pct, gpu_tiler_pct - mlx_flash_gpu_memory_used_bytes - All from macOS ioreg IOAccelerator Server: - GET /gpu endpoint: Metal GPU stats via ioreg - Auto-detect HF cache dir for expert store (no --expert-dir needed) - Default cache 2GB (was 512MB)
STRATEGY.md: - Honest competitive position vs oMLX, SwiftLM, Flash-MoE, Ollama, LM Studio - Three tracks: close gaps, own niche (observability), go viral - Priority: brew install → cost calculator → team mode → HN post → KV caching - "Team ops" positioning: only production-grade local LLM server - Concrete viral plan: demo video, HN post, Claude Code guide, benchmarks page
Chat UI: - Per-message savings in metadata bar: "saved $0.04" (vs Claude Sonnet $15/1M tok) - Cumulative session savings in status bar: "saved $1.23 vs Claude API" - Total tokens tracked in localStorage (persists across reloads) Dashboard: - New "Saved vs Cloud" card showing cumulative dollar savings - Calculated from total tokens generated × Claude Sonnet pricing This is the viral screenshot: "MLX-Flash saved your team $487 this month"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a visible info bar below each AI response showing inference stats.
Each AI message now displays:
Styled as subtle dim text — visible but not intrusive.
Test plan