Skip to content

Commit 4bdd492

Browse files
liuchangyanhao022
authored andcommitted
docs: add MetaX GPU monitoring metrics
1 parent 3d4e206 commit 4bdd492

File tree

2 files changed

+78
-0
lines changed

2 files changed

+78
-0
lines changed

docs/key-feature/gpu-metrics_en.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: GPU Metrics
3+
type: docs
4+
description:
5+
author: HUATUO Team
6+
date: 2026-02-25
7+
weight: 4
8+
---
9+
10+
Supported GPU Platforms (Current Version):
11+
- MetaX
12+
13+
|Subsystem|Metric|Description|Unit|Dimensions|Source|
14+
|---|----|---|---|---|---|
15+
|gpu|metax_gpu_sdk_info|GPU SDK info.|-|version|sml.GetSDKVersion|
16+
|gpu|metax_gpu_driver_info|GPU driver info.|-|version|sml.GetGPUVersion with driver unit|
17+
|gpu|metax_gpu_info|GPU info.|-|gpu, model, uuid, bios_version, bdf, mode, die_count|sml.GetGPUInfo|
18+
|gpu|metax_gpu_board_power_watts|GPU board power.|W|gpu|sml.ListGPUBoardWayElectricInfos|
19+
|gpu|metax_gpu_pcie_link_speed_gt_per_second|GPU PCIe current link speed.|GT/s|gpu|sml.GetGPUPcieLinkInfo|
20+
|gpu|metax_gpu_pcie_link_width_lanes|GPU PCIe current link width.|lanes|gpu|sml.GetGPUPcieLinkInfo|
21+
|gpu|metax_gpu_pcie_receive_bytes_per_second|GPU PCIe receive throughput.|B/s|gpu|sml.GetGPUPcieThroughputInfo|
22+
|gpu|metax_gpu_pcie_transmit_bytes_per_second|GPU PCIe transmit throughput.|B/s|gpu|sml.GetGPUPcieThroughputInfo|
23+
|gpu|metax_gpu_metaxlink_link_speed_gt_per_second|GPU MetaXLink current link speed.|GT/s|gpu, metaxlink|sml.ListGPUMetaXLinkLinkInfos|
24+
|gpu|metax_gpu_metaxlink_link_width_lanes|GPU MetaXLink current link width.|lanes|gpu, metaxlink|sml.ListGPUMetaXLinkLinkInfos|
25+
|gpu|metax_gpu_metaxlink_receive_bytes_per_second|GPU MetaXLink receive throughput.|B/s|gpu, metaxlink|sml.ListGPUMetaXLinkThroughputInfos|
26+
|gpu|metax_gpu_metaxlink_transmit_bytes_per_second|GPU MetaXLink transmit throughput.|B/s|gpu, metaxlink|sml.ListGPUMetaXLinkThroughputInfos|
27+
|gpu|metax_gpu_metaxlink_receive_bytes_total|GPU MetaXLink receive data size.|bytes|gpu, metaxlink|sml.ListGPUMetaXLinkTrafficStatInfos|
28+
|gpu|metax_gpu_metaxlink_transmit_bytes_total|GPU MetaXLink transmit data size.|bytes|gpu, metaxlink|sml.ListGPUMetaXLinkTrafficStatInfos|
29+
|gpu|metax_gpu_metaxlink_aer_errors_total|GPU MetaXLink AER errors count.|count|gpu, metaxlink, error_type|sml.ListGPUMetaXLinkAerErrorsInfos|
30+
|gpu|metax_gpu_status|GPU status, 0 means normal, other values means abnormal. Check the documentation to see the exceptions corresponding to each value.|-|gpu, die|sml.GetDieStatus|
31+
|gpu|metax_gpu_temperature_celsius|GPU temperature.|°C|gpu, die|sml.GetDieTemperature|
32+
|gpu|metax_gpu_utilization_percent|GPU utilization, ranging from 0 to 100.|%|gpu, die, ip|sml.GetDieUtilization|
33+
|gpu|metax_gpu_memory_total_bytes|Total vram.|bytes|gpu, die|sml.GetDieMemoryInfo|
34+
|gpu|metax_gpu_memory_used_bytes|Used vram.|bytes|gpu, die|sml.GetDieMemoryInfo|
35+
|gpu|metax_gpu_clock_mhz|GPU clock.|MHz|gpu, die, ip|sml.ListDieClocks|
36+
|gpu|metax_gpu_clocks_throttling|Reason(s) for GPU clocks throttling.|-|gpu, die, reason|sml.GetDieClocksThrottleStatus|
37+
|gpu|metax_gpu_dpm_performance_level|GPU DPM performance level.|-|gpu, die, ip|sml.GetDieDPMPerformanceLevel|
38+
|gpu|metax_gpu_ecc_memory_errors_total|GPU ECC memory errors count.|count|gpu, die, memory_type, error_type|sml.GetDieECCMemoryInfo|
39+
|gpu|metax_gpu_ecc_memory_retired_pages_total|GPU ECC memory retired pages count.|count|gpu, die|sml.GetDieECCMemoryInfo|

docs/key-feature/gpu-metrics_zh.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: GPU 指标说明
3+
type: docs
4+
description:
5+
author: HUATUO Team
6+
date: 2026-02-25
7+
weight: 4
8+
---
9+
10+
当前版本支持的 GPU 平台:
11+
- MetaX
12+
13+
|子系统|指标|描述|单位|统计纬度|指标来源|
14+
|---|----|---|---|---|---|
15+
|gpu|metax_gpu_sdk_info|GPU SDK 信息|-|version|sml.GetSDKVersion|
16+
|gpu|metax_gpu_driver_info|GPU 驱动信息|-|version|sml.GetGPUVersion with driver unit|
17+
|gpu|metax_gpu_info|GPU 基本信息|-|gpu, model, uuid, bios_version, bdf, mode, die_count|sml.GetGPUInfo|
18+
|gpu|metax_gpu_board_power_watts|GPU 板级功耗|瓦特(W)|gpu|sml.ListGPUBoardWayElectricInfos|
19+
|gpu|metax_gpu_pcie_link_speed_gt_per_second|GPU PCIe 当前链路速率|千兆次传输每秒(GT/s)|gpu|sml.GetGPUPcieLinkInfo|
20+
|gpu|metax_gpu_pcie_link_width_lanes|GPU PCIe 当前链路宽度|链路宽度(通道数)|gpu|sml.GetGPUPcieLinkInfo|
21+
|gpu|metax_gpu_pcie_receive_bytes_per_second|GPU PCIe 接收吞吐率|字节数/秒|gpu|sml.GetGPUPcieThroughputInfo|
22+
|gpu|metax_gpu_pcie_transmit_bytes_per_second|GPU PCIe 发送吞吐率|字节数/秒|gpu|sml.GetGPUPcieThroughputInfo|
23+
|gpu|metax_gpu_metaxlink_link_speed_gt_per_second|GPU MetaXLink 当前链路速率|千兆次传输每秒(GT/s)|gpu, metaxlink|sml.ListGPUMetaXLinkLinkInfos|
24+
|gpu|metax_gpu_metaxlink_link_width_lanes|GPU MetaXLink 当前链路宽度|链路宽度(通道数)|gpu, metaxlink|sml.ListGPUMetaXLinkLinkInfos|
25+
|gpu|metax_gpu_metaxlink_receive_bytes_per_second|GPU MetaXLink 接收吞吐率|字节数/秒|gpu, metaxlink|sml.ListGPUMetaXLinkThroughputInfos|
26+
|gpu|metax_gpu_metaxlink_transmit_bytes_per_second|GPU MetaXLink 发送吞吐率|字节数/秒|gpu, metaxlink|sml.ListGPUMetaXLinkThroughputInfos|
27+
|gpu|metax_gpu_metaxlink_receive_bytes_total|GPU MetaXLink 接收数据总量|字节数|gpu, metaxlink|sml.ListGPUMetaXLinkTrafficStatInfos|
28+
|gpu|metax_gpu_metaxlink_transmit_bytes_total|GPU MetaXLink 发送数据总量|字节数|gpu, metaxlink|sml.ListGPUMetaXLinkTrafficStatInfos|
29+
|gpu|metax_gpu_metaxlink_aer_errors_total|GPU MetaXLink AER 错误次数|计数|gpu, metaxlink, error_type|sml.ListGPUMetaXLinkAerErrorsInfos|
30+
|gpu|metax_gpu_status|GPU 状态(0 表示正常,其它值表示异常,具体含义需参考文档)|-|gpu, die|sml.GetDieStatus|
31+
|gpu|metax_gpu_temperature_celsius|GPU 温度|摄氏度|gpu, die|sml.GetDieTemperature|
32+
|gpu|metax_gpu_utilization_percent|GPU 利用率(0–100)|%|gpu, die, ip|sml.GetDieUtilization|
33+
|gpu|metax_gpu_memory_total_bytes|显存总容量|字节数|gpu, die|sml.GetDieMemoryInfo|
34+
|gpu|metax_gpu_memory_used_bytes|已使用显存容量|字节数|gpu, die|sml.GetDieMemoryInfo|
35+
|gpu|metax_gpu_clock_mhz|GPU 时钟频率|兆赫兹(MHz)|gpu, die, ip|sml.ListDieClocks|
36+
|gpu|metax_gpu_clocks_throttling|GPU 时钟降频原因|-|gpu, die, reason|sml.GetDieClocksThrottleStatus|
37+
|gpu|metax_gpu_dpm_performance_level|GPU DPM 性能等级|-|gpu, die, ip|sml.GetDieDPMPerformanceLevel|
38+
|gpu|metax_gpu_ecc_memory_errors_total|GPU ECC 内存错误次数|计数|gpu, die, memory_type, error_type|sml.GetDieECCMemoryInfo|
39+
|gpu|metax_gpu_ecc_memory_retired_pages_total|GPU ECC 内存退役页数|计数|gpu, die|sml.GetDieECCMemoryInfo|

0 commit comments

Comments
 (0)