feat: per-message metadata bar (tokens, speed, model, pressure) by szibis · Pull Request #7 · szibis/MLX-Flash

szibis · 2026-04-06T14:04:34Z

Summary

Adds a visible info bar below each AI response showing inference stats.

Each AI message now displays:

Model — short name (e.g. "Qwen3-30B-A3B-4bit")
Tokens — completion token count
Speed — tok/s (from mlx_flash_compress or calculated)
Time — elapsed seconds
Pressure — color-coded badge (green/yellow/red)

Styled as subtle dim text — visible but not intrusive.

Test plan

Send a message in chat, verify metadata bar appears below AI response
Metadata shows model name, token count, speed, time
Pressure badge shows correct color based on memory state

Each AI response now shows a subtle info bar below it with: - Model name (short form, e.g. "Qwen3-30B-A3B-4bit") - Token count (completion_tokens from response) - Speed (tok/s — from mlx_flash_compress data or calculated) - Elapsed time - Memory pressure indicator (green/yellow/red badge) Metadata fetched from /status after each generation. Styled as dim text below the response — visible but not intrusive.

Pressure detection: - New from_system_state() uses available_gb (free + reclaimable) instead of just free_gb — avoids false Critical on systems with large inactive pools - Accounts for swap: >8GB swap with <20% available = Critical - Your system (8GB available, 13GB swap) correctly shows Warning instead of false Critical; at 3.5GB available it correctly escalates to Critical - Tests for real-world scenario (36GB Mac, heavy swap) Chat metadata: - Fix duplicate "tok/s tok/s" display — checks if suffix already present

Token counter: - Rust proxy now extracts completion_tokens from Python response and increments tokens_generated_total (was always 0) - Dashboard falls back to Python worker aggregated total when Rust counter is 0 Cache panel: - Shows "not enabled (start with --expert-dir)" instead of "--%/0" - Proper N/A display when expert cache is not configured Pressure detection: - from_system_state() considers available_gb + swap instead of just free_gb - 8.2GB available + 12.8GB swap on 36GB Mac → Warning (was false Critical) - Tests for real-world scenario Chat metadata: - Fix duplicate "tok/s tok/s" display

…t cache to 2GB - Auto-detects expert dir from ~/.cache/huggingface/hub/models--<org>--<model>/snapshots/ - Scans for .safetensors files in latest snapshot directory - Cache enables automatically when model is downloaded via HuggingFace - No --expert-dir flag needed for standard HF models - Default cache size increased from 512MB to 2048MB (better for 36GB+ Macs) - Logs the auto-detected path for transparency

Dashboard: - GPU (Metal) card: device utilization %, renderer %, tiler %, GPU memory - Chart history persists across page reload (localStorage) - Token counter uses max(Rust, Python worker) — always shows real total - Cache panel shows "allocated (standard inference mode)" when no expert streaming Prometheus /metrics: - mlx_flash_gpu_utilization_pct, gpu_renderer_pct, gpu_tiler_pct - mlx_flash_gpu_memory_used_bytes - All from macOS ioreg IOAccelerator Server: - GET /gpu endpoint: Metal GPU stats via ioreg - Auto-detect HF cache dir for expert store (no --expert-dir needed) - Default cache 2GB (was 512MB)

STRATEGY.md: - Honest competitive position vs oMLX, SwiftLM, Flash-MoE, Ollama, LM Studio - Three tracks: close gaps, own niche (observability), go viral - Priority: brew install → cost calculator → team mode → HN post → KV caching - "Team ops" positioning: only production-grade local LLM server - Concrete viral plan: demo video, HN post, Claude Code guide, benchmarks page

Chat UI: - Per-message savings in metadata bar: "saved $0.04" (vs Claude Sonnet $15/1M tok) - Cumulative session savings in status bar: "saved $1.23 vs Claude API" - Total tokens tracked in localStorage (persists across reloads) Dashboard: - New "Saved vs Cloud" card showing cumulative dollar savings - Calculated from total tokens generated × Claude Sonnet pricing This is the viral screenshot: "MLX-Flash saved your team $487 this month"

szibis added 8 commits April 6, 2026 16:04

chore: update Cargo.lock

52e7e96

szibis merged commit f722b8b into main Apr 6, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: per-message metadata bar (tokens, speed, model, pressure)#7

feat: per-message metadata bar (tokens, speed, model, pressure)#7
szibis merged 8 commits intomainfrom
feat/chat-info-bar

szibis commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szibis commented Apr 6, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant