Skip to content

Commit 753cb06

Browse files
authored
feat: add comprehensive Google TPU monitoring support (#79)
* feat: add Google TPU support via libtpu Add support for monitoring Google Cloud TPU (Tensor Processing Unit) accelerators. This implementation follows the existing device reader patterns used for other NPU/accelerator devices. Features: - GoogleTpuReader implementing GpuReader trait - Support for all TPU generations (v2, v3, v4, v5e, v5p, v6 Trillium, v7 Ironwood) - TPU device detection via /dev/accel*, sysfs vendor ID, and libtpu.so - Memory size specifications for each TPU generation - Python/JAX integration for device enumeration (Option B from issue) - Unit tests with mocked TPU data Closes #75 * docs: add Google Cloud TPU support to README - Added TPU to list of supported accelerators in description - Added TPU to platform-specific features section - Added TPU to cross-platform support section - Added TPU to mock server and API metrics sections * fix(google_tpu): improve security and code quality in TPU reader HIGH Priority Fixes: - Remove unsafe fallback device detection without vendor verification (H1) Previously returned true for any /dev/accel* device without vendor check, which could misidentify Intel Gaudi devices as TPUs. Now requires positive verification of Google vendor ID (0x1ae0). - Add JSON schema validation for Python script output (H2) Validates utilization range (0-100%), memory consistency, non-negative power values, and temperature range (0-200°C) to prevent malformed data. MEDIUM Priority Fixes: - Remove duplicated LIBTPU_PATHS constant (M3) Now uses constant from device::common::constants::google_tpu module. - Replace blocking mutex lock with try_lock (M4) Prevents potential deadlocks in is_tpu_script_available() during concurrent initialization. - Replace bare 'except:' with 'except Exception:' in Python code (M5) Prevents catching system exceptions like KeyboardInterrupt. Race condition between TPU and Gaudi detection (M6) is already properly handled in platform_detection.rs with vendor ID verification. All changes verified with cargo build, cargo clippy, and release build. * fix: use centralized LIBTPU_PATHS constant in platform_detection - Remove duplicate LIBTPU_PATHS constant definition - Import from device::common::constants::google_tpu module - Add cfg(target_os = "linux") to match function scope * feat: add dynamic libtpu path detection for Python environments Extend libtpu library detection to search in Python environments: - $HOME/.local/lib/python*/site-packages/libtpu/libtpu.so - Virtual environments (VIRTUAL_ENV/lib/python*/site-packages) - Conda/mamba environments (anaconda3, miniconda3, mambaforge, miniforge3) - System Python site-packages (/usr/lib/python*, /usr/local/lib/python*) This allows detection of newer libtpu versions installed via pip or conda that may not be in standard system library paths. Add find_libtpu_paths() function to enumerate all libtpu locations and is_libtpu_available() for quick availability check. * add: TPU behavior test code * fix: improve TPU detection and suppress JAX logging output Issues fixed: 1. JAX stdout/stderr pollution causing TUI screen corruption - Add environment variables to suppress TensorFlow/JAX logging - Redirect stderr to /dev/null during JAX import - Suppress warnings and GRPC/ABSL logging 2. TPU v6e detection not working (no /dev/accel* devices) - Add TPU VM environment variable detection (TPU_NAME, TPU_CHIPS_PER_HOST_BOUNDS, TPU_ACCELERATOR_TYPE, etc.) - Add libtpu + TPU worker environment check - Add GCE metadata server check for TPU VMs - Support Cloud TPU VMs where TPU access is via gRPC, not /dev/accel 3. False positive TPU detection when only libtpu.so is installed - Remove standalone libtpu check from has_google_tpu() - Require actual TPU hardware or TPU VM environment indicators * fix: remove Python/JAX from TPU detection to prevent TUI corruption Root cause: Python with JAX import takes 10+ seconds to initialize. The command timeout (2s) triggers, but the orphaned Python process continues running and outputs to stdout/stderr, corrupting the TUI. Solution: Replace Python-based detection with pure Rust implementation: - Use sysfs (/dev/accel*, /sys/class/accel/) for on-premise TPUs - Use environment variables for Cloud TPU VMs (v6e, etc.): - TPU_NAME, TPU_ACCELERATOR_TYPE, TPU_CHIPS_PER_HOST_BOUNDS - TPU_WORKER_ID, TPU_WORKER_HOSTNAMES - Parse accelerator type string to detect TPU version - Detect chip count from TPU_CHIPS_PER_HOST_BOUNDS This eliminates all external process execution in the TPU reader, preventing any possibility of output pollution. * fix: add TPU v6e support and eliminate process accumulation - Add V6e variant to TpuGeneration enum with 16GB HBM memory - Fix parse_accelerator_type() to properly detect v6e before v6 - Remove lspci/curl calls from has_google_tpu() to prevent process accumulation - TPU v6e now displays as "Google TPU v6e" with correct memory size * fix: properly kill child processes on command timeout The previous implementation spawned a detached thread to run commands, which caused process accumulation when timeouts occurred - the child process continued running even after the timeout. Now uses spawn() + try_wait() polling with proper kill() and wait() on timeout to ensure child processes are terminated and reaped. * feat: add libtpuinfo FFI bindings for real TPU metrics - Add libtpuinfo.rs module with Rust FFI bindings to libtpuinfo.so - Dynamically load libtpuinfo.so from known paths - Integrate real TPU metrics (memory usage, duty cycle) into GoogleTpuReader - Priority: libtpuinfo > sysfs detection > environment variables - Add libloading dependency for Linux libtpuinfo provides: - Device count and IDs - Memory usage and total memory per device - Duty cycle percentage (utilization) * fix: resolve libtpuinfo FFI borrow checker error * feat: add /dev/vfio detection and tpu-info CLI parsing - Add detect_tpu_from_vfio() for v6e and newer TPUs using /dev/vfio/* - Add get_accelerator_type_from_tpu_info() to parse `tpu-info -v` output - Detection priority: libtpuinfo > /dev/accel* > /dev/vfio > env vars - No libtpuinfo.so installation required if tpu-info CLI is available * fix: detect TPU via tpu-info CLI without requiring env vars * fix: detect TPU without tpu-info CLI using libtpu.so presence TPU detection priority for /dev/vfio devices: 1. tpu-info CLI (most reliable for version) 2. TPU_ACCELERATOR_TYPE env var 3. TPU_* env vars (TPU_CHIPS_PER_HOST_BOUNDS, etc.) 4. libtpu.so existence If any of these is true, TPU is detected. Version may be "unknown" if tpu-info and env vars are unavailable. * fix: suppress unused code warnings in TPU readers * feat: implement native Google TPU monitoring via Sysfs and basic PJRT support * fix: add PCI scanning fallback for TPU detection * fix: refine TPU detection and add dynamic library search * fix: prioritize user-installed libtpu over system paths * feat: add initial PJRT C API bindings and integration * feat: implement singleton pattern for PJRT client * feat: enable active PJRT metrics collection via C API bindings * feat: move PJRT initialization to background thread to prevent UI blocking * feat: add notification for TPU initialization status * fix: disable unsafe PJRT client creation to prevent segfaults and report limited mode * feat: revert to tpu-info dependency strategy and add installation notification * feat: implement streaming tpu-info runner for background metrics collection * feat: parse tpu-info output to populate GPU metrics * feat: enhance tpu-info output parser with unit handling and debug logs * fix: update TPU PCI ID mapping to detect v6e devices correctly * feat: support per-device metrics in tpu-info runner * feat: robust tpu-info table parser and accurate generation mapping * fix: adjust tpu-info streaming rate to 2s for optimal performance * fix: prolong initialization status visibility by delaying update * feat: add debug logging for tpu-info raw output * fix: update tpu-info command arguments to use multiple --metric flags * fix: remove unsupported metrics memory_total and power_usage from tpu-info command * fix: update tpu-info parser to handle rich tables and restore missing structs * fix: switch to tpu-info default output mode for better compatibility * Fix orphaned tpu-info process and silence stderr to prevent deadlocks * Update TPU runner status only when metrics are successfully parsed * Cache TPU accelerator type to prevent redundant tpu-info process spawning * Limit Tokio worker threads to 4 to reduce system resource overhead * Fix test compilation error by initializing tpu_notification_shown in AppState * Add Google TPU metrics support to API mode and update documentation * Fix TPU metrics not updating: strip ANSI codes, cache metadata, fetch VFIO metrics * Enhance TPU output parsing: support ASCII tables and robust ANSI stripping * Fix TPU metrics not captured: switch from streaming to polling mode The tpu-info --streaming mode uses Rich's Live display with screen=True, which writes to an alternate screen buffer and cannot be captured via stdout pipes. This caused all-smi to receive no metric data. Changes: - tpu_info_runner.rs: Replace streaming mode with periodic polling - Run tpu-info without --streaming flag every 2 seconds - Use Command::output() to capture complete stdout - Remove unused imports (BufRead, BufReader, Stdio, CommandExt) - google_tpu.rs: Add dynamic TPU metric exports - all_smi_tpu_utilization_percent (duty cycle) - all_smi_tpu_memory_used_bytes - all_smi_tpu_memory_total_bytes - all_smi_tpu_memory_utilization_percent * Add native gRPC client for TPU metrics with adaptive polling Implement direct gRPC communication with libtpu runtime metrics server (localhost:8431) to collect TPU metrics without CLI overhead. Changes: - Add tonic/prost dependencies for gRPC support - Add proto/tpu_metric_service.proto for libtpu gRPC definitions - Add build.rs for proto compilation - Add src/device/readers/tpu_grpc.rs: native gRPC client - Connects to libtpu metrics server when TPU workload is running - Fetches memory usage, total memory, and duty cycle metrics - Tracks connection state and notifies tpu_info_runner - Update tpu_info_runner.rs: adaptive polling - gRPC available: skip CLI execution (gRPC handles metrics) - gRPC unavailable: poll CLI every 30s (was 2s) - Reduces system overhead by 15x when no workload running - Update google_tpu.rs: gRPC-first with CLI fallback - Try gRPC for real-time metrics - Fall back to CLI polling if gRPC unavailable - Remove libtpuinfo.rs (unmaintained external dependency) * Add TensorCore utilization display for TPU devices - Add tensorcore_utilization field to GpuInfo struct - Display TC gauge bar for TPU (similar to ANE for Apple Silicon) - Fix metric source: use CLI for tensorcore_util, gRPC for duty_cycle - TensorCore util comes from libtpu SDK monitoring (not gRPC) - Update all device readers with new tensorcore_utilization field * Add HLO metrics display and clean up clippy warnings - Add HLO Queue Size display in TPU view (shows 0 when unavailable) - Add HLO Exec Mean metric storage in detail map - Convert all debug eprintln! statements to tracing::debug! - Fix all clippy warnings: - uninlined_format_args - dead_code (add #[allow] for reserved fields/methods) - redundant_closure - bind_instead_of_map - new_without_default (add Default impl for TpuInfoRunner) - collapsible_str_replace - manual_range_contains - collapsible_if - enum_variant_names (suppress in build.rs for generated protobuf) - Improve device_ordinal extraction to handle both String and Int attrs - Update API documentation for HLO metrics
1 parent d63770f commit 753cb06

40 files changed

+4243
-55
lines changed

API.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,42 @@ Note: Furiosa NPUs use the RNGD architecture with 8 cores per NPU. Each core con
284284

285285
Note: Intel Gaudi NPUs (Gaudi 1/2/3) are monitored via the `hl-smi` command-line tool running as a background process. Device names are automatically mapped from internal identifiers (e.g., HL-325L) to human-friendly names (e.g., Intel Gaudi 3 PCIe LP). The tool supports various form factors including PCIe, OAM, UBB, and HLS variants.
286286

287+
### Google TPU Metrics
288+
289+
#### Basic NPU Metrics
290+
| Metric | Description | Unit | Labels |
291+
|---------------------------------------|----------------------------|---------|-------------------------------------------|
292+
| `all_smi_gpu_utilization` | TPU utilization percentage | percent | `gpu_index`, `gpu_name` |
293+
| `all_smi_gpu_memory_used_bytes` | TPU memory used | bytes | `gpu_index`, `gpu_name` |
294+
| `all_smi_gpu_memory_total_bytes` | TPU memory total | bytes | `gpu_index`, `gpu_name` |
295+
| `all_smi_gpu_temperature_celsius` | TPU temperature | celsius | `gpu_index`, `gpu_name` |
296+
| `all_smi_gpu_power_consumption_watts` | TPU power consumption | watts | `gpu_index`, `gpu_name` |
297+
| `all_smi_gpu_frequency_mhz` | TPU clock frequency | MHz | `gpu_index`, `gpu_name` |
298+
| `all_smi_gpu_info` | TPU device information | info | `gpu_index`, `gpu_name`, `driver_version` |
299+
300+
#### TPU-Specific Metrics
301+
| Metric | Description | Unit | Labels |
302+
|--------------------------------------------|--------------------------------------|-------|----------------------------------------------------------|
303+
| `all_smi_tpu_utilization_percent` | TPU duty cycle utilization | percent| `npu`, `instance`, `uuid`, `index` |
304+
| `all_smi_tpu_memory_used_bytes` | TPU HBM memory used | bytes | `npu`, `instance`, `uuid`, `index` |
305+
| `all_smi_tpu_memory_total_bytes` | TPU HBM memory total | bytes | `npu`, `instance`, `uuid`, `index` |
306+
| `all_smi_tpu_memory_utilization_percent` | TPU HBM memory utilization percentage| percent| `npu`, `instance`, `uuid`, `index` |
307+
| `all_smi_tpu_chip_version_info` | TPU chip version information | info | `npu`, `instance`, `uuid`, `index`, `version` |
308+
| `all_smi_tpu_accelerator_type_info` | TPU accelerator type information | info | `npu`, `instance`, `uuid`, `index`, `type` |
309+
| `all_smi_tpu_core_count` | Number of TPU cores | gauge | `npu`, `instance`, `uuid`, `index` |
310+
| `all_smi_tpu_tensorcore_count` | Number of TensorCores per chip | gauge | `npu`, `instance`, `uuid`, `index` |
311+
| `all_smi_tpu_memory_type_info` | TPU memory type (HBM2/HBM2e/HBM3e) | info | `npu`, `instance`, `uuid`, `index`, `type` |
312+
| `all_smi_tpu_runtime_version_info` | TPU runtime/library version | info | `npu`, `instance`, `uuid`, `index`, `version` |
313+
| `all_smi_tpu_power_max_watts` | TPU maximum power limit | watts | `npu`, `instance`, `uuid`, `index` |
314+
| `all_smi_tpu_hlo_queue_size` | Number of pending HLO programs | gauge | `npu`, `instance`, `uuid`, `index` |
315+
| `all_smi_tpu_hlo_exec_mean_microseconds` | HLO execution timing (mean) | µs | `npu`, `instance`, `uuid`, `index` |
316+
| `all_smi_tpu_hlo_exec_p50_microseconds` | HLO execution timing (P50) | µs | `npu`, `instance`, `uuid`, `index` |
317+
| `all_smi_tpu_hlo_exec_p90_microseconds` | HLO execution timing (P90) | µs | `npu`, `instance`, `uuid`, `index` |
318+
| `all_smi_tpu_hlo_exec_p95_microseconds` | HLO execution timing (P95) | µs | `npu`, `instance`, `uuid`, `index` |
319+
| `all_smi_tpu_hlo_exec_p999_microseconds` | HLO execution timing (P99.9) | µs | `npu`, `instance`, `uuid`, `index` |
320+
321+
Note: Google Cloud TPUs (v2-v7/Ironwood) are monitored via the `tpu-info` command-line tool running in streaming mode. Metrics include duty cycle utilization, HBM memory tracking, and chip configuration details.
322+
287323
### CPU Metrics (All Platforms)
288324

289325
| Metric | Description | Unit | Labels |
@@ -370,6 +406,7 @@ Runtime environment metrics are detected at startup and provide information abou
370406
| Linux + Tenstorrent | ✓ Full*** | ✓ Full | ✓ Full | ✗ N/A**** |
371407
| Linux + Rebellions | ✓ Full | ✓ Full | ✓ Full | ✗ N/A***** |
372408
| Linux + Furiosa | ✓ Full | ✓ Full | ✓ Full | ✗ N/A****** |
409+
| Linux + Google TPU | ✓ Full | ✓ Full | ✓ Full | ✗ N/A******** |
373410
| macOS + Apple Silicon | ✓ Partial* | ✓ Enhanced** | ✓ Full | ✓ Basic |
374411
| NVIDIA Jetson | ✓ Full + DLA | ✓ Full | ✓ Full | ✓ Full |
375412

@@ -380,6 +417,7 @@ Runtime environment metrics are detected at startup and provide information abou
380417
*****Rebellions NPUs do not expose per-process GPU usage information
381418
******Furiosa NPUs do not expose per-process GPU usage information
382419
*******Intel Gaudi NPUs do not expose per-process GPU usage information via hl-smi
420+
********Google Cloud TPUs do not expose per-process GPU usage information via tpu-info
383421

384422
## Example Prometheus Queries
385423

@@ -510,6 +548,24 @@ count by (internal_name) (all_smi_gaudi_internal_name_info)
510548
count by (version) (all_smi_gaudi_driver_info) > 1
511549
```
512550

551+
### Google TPU Specific
552+
```promql
553+
# TPU utilization across all chips
554+
avg(all_smi_tpu_utilization_percent)
555+
556+
# HBM memory utilization percentage
557+
all_smi_tpu_memory_utilization_percent
558+
559+
# Count TPUs by accelerator type
560+
count by (type) (all_smi_tpu_accelerator_type_info)
561+
562+
# Monitor HLO queue size
563+
all_smi_tpu_hlo_queue_size > 5
564+
565+
# Alert on high HLO execution latency
566+
all_smi_tpu_hlo_exec_p90_microseconds > 1000000
567+
```
568+
513569
### Process Monitoring
514570
```promql
515571
# Top 5 GPU memory consumers

0 commit comments

Comments
 (0)