Commit 753cb06
authored
feat: add comprehensive Google TPU monitoring support (#79)
* feat: add Google TPU support via libtpu
Add support for monitoring Google Cloud TPU (Tensor Processing Unit) accelerators.
This implementation follows the existing device reader patterns used for other
NPU/accelerator devices.
Features:
- GoogleTpuReader implementing GpuReader trait
- Support for all TPU generations (v2, v3, v4, v5e, v5p, v6 Trillium, v7 Ironwood)
- TPU device detection via /dev/accel*, sysfs vendor ID, and libtpu.so
- Memory size specifications for each TPU generation
- Python/JAX integration for device enumeration (Option B from issue)
- Unit tests with mocked TPU data
Closes #75
* docs: add Google Cloud TPU support to README
- Added TPU to list of supported accelerators in description
- Added TPU to platform-specific features section
- Added TPU to cross-platform support section
- Added TPU to mock server and API metrics sections
* fix(google_tpu): improve security and code quality in TPU reader
HIGH Priority Fixes:
- Remove unsafe fallback device detection without vendor verification (H1)
Previously returned true for any /dev/accel* device without vendor check,
which could misidentify Intel Gaudi devices as TPUs. Now requires positive
verification of Google vendor ID (0x1ae0).
- Add JSON schema validation for Python script output (H2)
Validates utilization range (0-100%), memory consistency, non-negative
power values, and temperature range (0-200°C) to prevent malformed data.
MEDIUM Priority Fixes:
- Remove duplicated LIBTPU_PATHS constant (M3)
Now uses constant from device::common::constants::google_tpu module.
- Replace blocking mutex lock with try_lock (M4)
Prevents potential deadlocks in is_tpu_script_available() during
concurrent initialization.
- Replace bare 'except:' with 'except Exception:' in Python code (M5)
Prevents catching system exceptions like KeyboardInterrupt.
Race condition between TPU and Gaudi detection (M6) is already properly
handled in platform_detection.rs with vendor ID verification.
All changes verified with cargo build, cargo clippy, and release build.
* fix: use centralized LIBTPU_PATHS constant in platform_detection
- Remove duplicate LIBTPU_PATHS constant definition
- Import from device::common::constants::google_tpu module
- Add cfg(target_os = "linux") to match function scope
* feat: add dynamic libtpu path detection for Python environments
Extend libtpu library detection to search in Python environments:
- $HOME/.local/lib/python*/site-packages/libtpu/libtpu.so
- Virtual environments (VIRTUAL_ENV/lib/python*/site-packages)
- Conda/mamba environments (anaconda3, miniconda3, mambaforge, miniforge3)
- System Python site-packages (/usr/lib/python*, /usr/local/lib/python*)
This allows detection of newer libtpu versions installed via pip or
conda that may not be in standard system library paths.
Add find_libtpu_paths() function to enumerate all libtpu locations
and is_libtpu_available() for quick availability check.
* add: TPU behavior test code
* fix: improve TPU detection and suppress JAX logging output
Issues fixed:
1. JAX stdout/stderr pollution causing TUI screen corruption
- Add environment variables to suppress TensorFlow/JAX logging
- Redirect stderr to /dev/null during JAX import
- Suppress warnings and GRPC/ABSL logging
2. TPU v6e detection not working (no /dev/accel* devices)
- Add TPU VM environment variable detection
(TPU_NAME, TPU_CHIPS_PER_HOST_BOUNDS, TPU_ACCELERATOR_TYPE, etc.)
- Add libtpu + TPU worker environment check
- Add GCE metadata server check for TPU VMs
- Support Cloud TPU VMs where TPU access is via gRPC, not /dev/accel
3. False positive TPU detection when only libtpu.so is installed
- Remove standalone libtpu check from has_google_tpu()
- Require actual TPU hardware or TPU VM environment indicators
* fix: remove Python/JAX from TPU detection to prevent TUI corruption
Root cause: Python with JAX import takes 10+ seconds to initialize.
The command timeout (2s) triggers, but the orphaned Python process
continues running and outputs to stdout/stderr, corrupting the TUI.
Solution: Replace Python-based detection with pure Rust implementation:
- Use sysfs (/dev/accel*, /sys/class/accel/) for on-premise TPUs
- Use environment variables for Cloud TPU VMs (v6e, etc.):
- TPU_NAME, TPU_ACCELERATOR_TYPE, TPU_CHIPS_PER_HOST_BOUNDS
- TPU_WORKER_ID, TPU_WORKER_HOSTNAMES
- Parse accelerator type string to detect TPU version
- Detect chip count from TPU_CHIPS_PER_HOST_BOUNDS
This eliminates all external process execution in the TPU reader,
preventing any possibility of output pollution.
* fix: add TPU v6e support and eliminate process accumulation
- Add V6e variant to TpuGeneration enum with 16GB HBM memory
- Fix parse_accelerator_type() to properly detect v6e before v6
- Remove lspci/curl calls from has_google_tpu() to prevent process accumulation
- TPU v6e now displays as "Google TPU v6e" with correct memory size
* fix: properly kill child processes on command timeout
The previous implementation spawned a detached thread to run commands,
which caused process accumulation when timeouts occurred - the child
process continued running even after the timeout.
Now uses spawn() + try_wait() polling with proper kill() and wait()
on timeout to ensure child processes are terminated and reaped.
* feat: add libtpuinfo FFI bindings for real TPU metrics
- Add libtpuinfo.rs module with Rust FFI bindings to libtpuinfo.so
- Dynamically load libtpuinfo.so from known paths
- Integrate real TPU metrics (memory usage, duty cycle) into GoogleTpuReader
- Priority: libtpuinfo > sysfs detection > environment variables
- Add libloading dependency for Linux
libtpuinfo provides:
- Device count and IDs
- Memory usage and total memory per device
- Duty cycle percentage (utilization)
* fix: resolve libtpuinfo FFI borrow checker error
* feat: add /dev/vfio detection and tpu-info CLI parsing
- Add detect_tpu_from_vfio() for v6e and newer TPUs using /dev/vfio/*
- Add get_accelerator_type_from_tpu_info() to parse `tpu-info -v` output
- Detection priority: libtpuinfo > /dev/accel* > /dev/vfio > env vars
- No libtpuinfo.so installation required if tpu-info CLI is available
* fix: detect TPU via tpu-info CLI without requiring env vars
* fix: detect TPU without tpu-info CLI using libtpu.so presence
TPU detection priority for /dev/vfio devices:
1. tpu-info CLI (most reliable for version)
2. TPU_ACCELERATOR_TYPE env var
3. TPU_* env vars (TPU_CHIPS_PER_HOST_BOUNDS, etc.)
4. libtpu.so existence
If any of these is true, TPU is detected. Version may be "unknown"
if tpu-info and env vars are unavailable.
* fix: suppress unused code warnings in TPU readers
* feat: implement native Google TPU monitoring via Sysfs and basic PJRT support
* fix: add PCI scanning fallback for TPU detection
* fix: refine TPU detection and add dynamic library search
* fix: prioritize user-installed libtpu over system paths
* feat: add initial PJRT C API bindings and integration
* feat: implement singleton pattern for PJRT client
* feat: enable active PJRT metrics collection via C API bindings
* feat: move PJRT initialization to background thread to prevent UI blocking
* feat: add notification for TPU initialization status
* fix: disable unsafe PJRT client creation to prevent segfaults and report limited mode
* feat: revert to tpu-info dependency strategy and add installation notification
* feat: implement streaming tpu-info runner for background metrics collection
* feat: parse tpu-info output to populate GPU metrics
* feat: enhance tpu-info output parser with unit handling and debug logs
* fix: update TPU PCI ID mapping to detect v6e devices correctly
* feat: support per-device metrics in tpu-info runner
* feat: robust tpu-info table parser and accurate generation mapping
* fix: adjust tpu-info streaming rate to 2s for optimal performance
* fix: prolong initialization status visibility by delaying update
* feat: add debug logging for tpu-info raw output
* fix: update tpu-info command arguments to use multiple --metric flags
* fix: remove unsupported metrics memory_total and power_usage from tpu-info command
* fix: update tpu-info parser to handle rich tables and restore missing structs
* fix: switch to tpu-info default output mode for better compatibility
* Fix orphaned tpu-info process and silence stderr to prevent deadlocks
* Update TPU runner status only when metrics are successfully parsed
* Cache TPU accelerator type to prevent redundant tpu-info process spawning
* Limit Tokio worker threads to 4 to reduce system resource overhead
* Fix test compilation error by initializing tpu_notification_shown in AppState
* Add Google TPU metrics support to API mode and update documentation
* Fix TPU metrics not updating: strip ANSI codes, cache metadata, fetch VFIO metrics
* Enhance TPU output parsing: support ASCII tables and robust ANSI stripping
* Fix TPU metrics not captured: switch from streaming to polling mode
The tpu-info --streaming mode uses Rich's Live display with screen=True,
which writes to an alternate screen buffer and cannot be captured via
stdout pipes. This caused all-smi to receive no metric data.
Changes:
- tpu_info_runner.rs: Replace streaming mode with periodic polling
- Run tpu-info without --streaming flag every 2 seconds
- Use Command::output() to capture complete stdout
- Remove unused imports (BufRead, BufReader, Stdio, CommandExt)
- google_tpu.rs: Add dynamic TPU metric exports
- all_smi_tpu_utilization_percent (duty cycle)
- all_smi_tpu_memory_used_bytes
- all_smi_tpu_memory_total_bytes
- all_smi_tpu_memory_utilization_percent
* Add native gRPC client for TPU metrics with adaptive polling
Implement direct gRPC communication with libtpu runtime metrics server
(localhost:8431) to collect TPU metrics without CLI overhead.
Changes:
- Add tonic/prost dependencies for gRPC support
- Add proto/tpu_metric_service.proto for libtpu gRPC definitions
- Add build.rs for proto compilation
- Add src/device/readers/tpu_grpc.rs: native gRPC client
- Connects to libtpu metrics server when TPU workload is running
- Fetches memory usage, total memory, and duty cycle metrics
- Tracks connection state and notifies tpu_info_runner
- Update tpu_info_runner.rs: adaptive polling
- gRPC available: skip CLI execution (gRPC handles metrics)
- gRPC unavailable: poll CLI every 30s (was 2s)
- Reduces system overhead by 15x when no workload running
- Update google_tpu.rs: gRPC-first with CLI fallback
- Try gRPC for real-time metrics
- Fall back to CLI polling if gRPC unavailable
- Remove libtpuinfo.rs (unmaintained external dependency)
* Add TensorCore utilization display for TPU devices
- Add tensorcore_utilization field to GpuInfo struct
- Display TC gauge bar for TPU (similar to ANE for Apple Silicon)
- Fix metric source: use CLI for tensorcore_util, gRPC for duty_cycle
- TensorCore util comes from libtpu SDK monitoring (not gRPC)
- Update all device readers with new tensorcore_utilization field
* Add HLO metrics display and clean up clippy warnings
- Add HLO Queue Size display in TPU view (shows 0 when unavailable)
- Add HLO Exec Mean metric storage in detail map
- Convert all debug eprintln! statements to tracing::debug!
- Fix all clippy warnings:
- uninlined_format_args
- dead_code (add #[allow] for reserved fields/methods)
- redundant_closure
- bind_instead_of_map
- new_without_default (add Default impl for TpuInfoRunner)
- collapsible_str_replace
- manual_range_contains
- collapsible_if
- enum_variant_names (suppress in build.rs for generated protobuf)
- Improve device_ordinal extraction to handle both String and Int attrs
- Update API documentation for HLO metrics1 parent d63770f commit 753cb06
File tree
40 files changed
+4243
-55
lines changed- docs
- man
- proto
- src
- api/metrics
- npu
- device
- common
- readers
- metrics
- network
- ui
- renderers
- utils
- view/data_collection
- tests
40 files changed
+4243
-55
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
284 | 284 | | |
285 | 285 | | |
286 | 286 | | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
287 | 323 | | |
288 | 324 | | |
289 | 325 | | |
| |||
370 | 406 | | |
371 | 407 | | |
372 | 408 | | |
| 409 | + | |
373 | 410 | | |
374 | 411 | | |
375 | 412 | | |
| |||
380 | 417 | | |
381 | 418 | | |
382 | 419 | | |
| 420 | + | |
383 | 421 | | |
384 | 422 | | |
385 | 423 | | |
| |||
510 | 548 | | |
511 | 549 | | |
512 | 550 | | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
513 | 569 | | |
514 | 570 | | |
515 | 571 | | |
| |||
0 commit comments