Skip to content

feat: per-message metadata bar (tokens, speed, model, pressure)#7

Merged
szibis merged 8 commits intomainfrom
feat/chat-info-bar
Apr 6, 2026
Merged

feat: per-message metadata bar (tokens, speed, model, pressure)#7
szibis merged 8 commits intomainfrom
feat/chat-info-bar

Conversation

@szibis
Copy link
Copy Markdown
Owner

@szibis szibis commented Apr 6, 2026

Summary

Adds a visible info bar below each AI response showing inference stats.

Each AI message now displays:

  • Model — short name (e.g. "Qwen3-30B-A3B-4bit")
  • Tokens — completion token count
  • Speed — tok/s (from mlx_flash_compress or calculated)
  • Time — elapsed seconds
  • Pressure — color-coded badge (green/yellow/red)

Styled as subtle dim text — visible but not intrusive.

Test plan

  • Send a message in chat, verify metadata bar appears below AI response
  • Metadata shows model name, token count, speed, time
  • Pressure badge shows correct color based on memory state

szibis added 8 commits April 6, 2026 16:04
Each AI response now shows a subtle info bar below it with:
- Model name (short form, e.g. "Qwen3-30B-A3B-4bit")
- Token count (completion_tokens from response)
- Speed (tok/s — from mlx_flash_compress data or calculated)
- Elapsed time
- Memory pressure indicator (green/yellow/red badge)

Metadata fetched from /status after each generation.
Styled as dim text below the response — visible but not intrusive.
Pressure detection:
- New from_system_state() uses available_gb (free + reclaimable) instead of
  just free_gb — avoids false Critical on systems with large inactive pools
- Accounts for swap: >8GB swap with <20% available = Critical
- Your system (8GB available, 13GB swap) correctly shows Warning instead of
  false Critical; at 3.5GB available it correctly escalates to Critical
- Tests for real-world scenario (36GB Mac, heavy swap)

Chat metadata:
- Fix duplicate "tok/s tok/s" display — checks if suffix already present
Token counter:
- Rust proxy now extracts completion_tokens from Python response and
  increments tokens_generated_total (was always 0)
- Dashboard falls back to Python worker aggregated total when Rust counter is 0

Cache panel:
- Shows "not enabled (start with --expert-dir)" instead of "--%/0"
- Proper N/A display when expert cache is not configured

Pressure detection:
- from_system_state() considers available_gb + swap instead of just free_gb
- 8.2GB available + 12.8GB swap on 36GB Mac → Warning (was false Critical)
- Tests for real-world scenario

Chat metadata:
- Fix duplicate "tok/s tok/s" display
…t cache to 2GB

- Auto-detects expert dir from ~/.cache/huggingface/hub/models--<org>--<model>/snapshots/
- Scans for .safetensors files in latest snapshot directory
- Cache enables automatically when model is downloaded via HuggingFace
- No --expert-dir flag needed for standard HF models
- Default cache size increased from 512MB to 2048MB (better for 36GB+ Macs)
- Logs the auto-detected path for transparency
Dashboard:
- GPU (Metal) card: device utilization %, renderer %, tiler %, GPU memory
- Chart history persists across page reload (localStorage)
- Token counter uses max(Rust, Python worker) — always shows real total
- Cache panel shows "allocated (standard inference mode)" when no expert streaming

Prometheus /metrics:
- mlx_flash_gpu_utilization_pct, gpu_renderer_pct, gpu_tiler_pct
- mlx_flash_gpu_memory_used_bytes
- All from macOS ioreg IOAccelerator

Server:
- GET /gpu endpoint: Metal GPU stats via ioreg
- Auto-detect HF cache dir for expert store (no --expert-dir needed)
- Default cache 2GB (was 512MB)
STRATEGY.md:
- Honest competitive position vs oMLX, SwiftLM, Flash-MoE, Ollama, LM Studio
- Three tracks: close gaps, own niche (observability), go viral
- Priority: brew install → cost calculator → team mode → HN post → KV caching
- "Team ops" positioning: only production-grade local LLM server
- Concrete viral plan: demo video, HN post, Claude Code guide, benchmarks page
Chat UI:
- Per-message savings in metadata bar: "saved $0.04" (vs Claude Sonnet $15/1M tok)
- Cumulative session savings in status bar: "saved $1.23 vs Claude API"
- Total tokens tracked in localStorage (persists across reloads)

Dashboard:
- New "Saved vs Cloud" card showing cumulative dollar savings
- Calculated from total tokens generated × Claude Sonnet pricing

This is the viral screenshot: "MLX-Flash saved your team $487 this month"
@szibis szibis merged commit f722b8b into main Apr 6, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant