Skip to content

Commit 46904ee

Browse files
committed
Merge branch 'development' into alex_mod
2 parents 41e0a24 + 20d5fc6 commit 46904ee

22 files changed

+990
-55
lines changed

docs/PLUGIN_DOC.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
| MemoryPlugin | free -b<br>/usr/bin/lsmem<br>wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value | - | [MemoryDataModel](#MemoryDataModel-Model) | [MemoryCollector](#Collector-Class-MemoryCollector) | [MemoryAnalyzer](#Data-Analyzer-Class-MemoryAnalyzer) |
1818
| NvmePlugin | nvme smart-log {dev}<br>nvme error-log {dev} --log-entries=256<br>nvme id-ctrl {dev}<br>nvme id-ns {dev}{ns}<br>nvme fw-log {dev}<br>nvme self-test-log {dev}<br>nvme get-log {dev} --log-id=6 --log-len=512<br>nvme telemetry-log {dev} --output-file={dev}_{f_name} | - | [NvmeDataModel](#NvmeDataModel-Model) | [NvmeCollector](#Collector-Class-NvmeCollector) | - |
1919
| OsPlugin | sh -c '( lsb_release -ds \|\| (cat /etc/*release \| grep PRETTY_NAME) \|\| uname -om ) 2>/dev/null \| head -n1'<br>cat /etc/*release \| grep VERSION_ID<br>wmic os get Version /value<br>wmic os get Caption /Value | **Analyzer Args:**<br>- `exp_os`: Union[str, list]<br>- `exact_match`: bool | [OsDataModel](#OsDataModel-Model) | [OsCollector](#Collector-Class-OsCollector) | [OsAnalyzer](#Data-Analyzer-Class-OsAnalyzer) |
20-
| PackagePlugin | dnf list --installed<br>dpkg-query -W<br>pacman -Q<br>cat /etc/*release<br>wmic product get name,version | **Analyzer Args:**<br>- `exp_package_ver`: Dict[str, Optional[str]]<br>- `regex_match`: bool | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) |
21-
| PciePlugin | lspci -d {vendor_id}: -nn<br>lspci -x<br>lspci -xxxx<br>lspci -PP<br>lspci -PP -d {vendor_id}:{dev_id}<br>lspci -vt<br>lspci -vvv | **Analyzer Args:**<br>- `exp_speed`: int<br>- `exp_width`: int<br>- `exp_sriov_count`: int<br>- `exp_gpu_count_override`: Optional[int]<br>- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) |
20+
| PackagePlugin | dnf list --installed<br>dpkg-query -W<br>pacman -Q<br>cat /etc/*release<br>wmic product get name,version | **Analyzer Args:**<br>- `exp_package_ver`: Dict[str, Optional[str]]<br>- `regex_match`: bool<br>- `rocm_regex`: Optional[str]<br>- `enable_rocm_regex`: bool | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) |
21+
| PciePlugin | lspci -d {vendor_id}: -nn<br>lspci -x<br>lspci -xxxx<br>lspci -PP<br>lspci -PP -d {vendor_id}:{dev_id}<br>lspci -vvv<br>lspci -vvvt | **Analyzer Args:**<br>- `exp_speed`: int<br>- `exp_width`: int<br>- `exp_sriov_count`: int<br>- `exp_gpu_count_override`: Optional[int]<br>- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) |
2222
| ProcessPlugin | top -b -n 1<br>rocm-smi --showpids<br>top -b -n 1 -o %CPU | **Analyzer Args:**<br>- `max_kfd_processes`: int<br>- `max_cpu_usage`: float | [ProcessDataModel](#ProcessDataModel-Model) | [ProcessCollector](#Collector-Class-ProcessCollector) | [ProcessAnalyzer](#Data-Analyzer-Class-ProcessAnalyzer) |
2323
| RocmPlugin | {rocm_path}/opencl/bin/*/clinfo<br>env \| grep -Ei 'rocm\|hsa\|hip\|mpi\|openmp\|ucx\|miopen'<br>ls /sys/class/kfd/kfd/proc/<br>grep -i -E 'rocm' /etc/ld.so.conf.d/*<br>{rocm_path}/bin/rocminfo<br>ls -v -d /opt/rocm*<br>ls -v -d /opt/rocm-[3-7]* \| tail -1<br>ldconfig -p \| grep -i -E 'rocm'<br>/opt/rocm/.info/version-rocm<br>/opt/rocm/.info/version | **Analyzer Args:**<br>- `exp_rocm`: Union[str, list]<br>- `exp_rocm_latest`: str | [RocmDataModel](#RocmDataModel-Model) | [RocmCollector](#Collector-Class-RocmCollector) | [RocmAnalyzer](#Data-Analyzer-Class-RocmAnalyzer) |
2424
| StoragePlugin | sh -c 'df -lH -B1 \| grep -v 'boot''<br>wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace | - | [StorageDataModel](#StorageDataModel-Model) | [StorageCollector](#Collector-Class-StorageCollector) | [StorageAnalyzer](#Data-Analyzer-Class-StorageAnalyzer) |
@@ -428,7 +428,7 @@ class for collection of PCIe data only supports Linux OS type.
428428

429429
This class will collect important PCIe data from the system running the commands
430430
- `lspci -vvv` : Verbose collection of PCIe data
431-
- `lspci -vt`: Tree view of PCIe data
431+
- `lspci -vvvt`: Verbose tree view of PCIe data
432432
- `lspci -PP`: Path view of PCIe data for the GPUs
433433
- If system interaction level is set to STANDARD or higher, the following commands will be run with sudo:
434434
- `lspci -xxxx`: Hex view of PCIe data for the GPUs
@@ -448,7 +448,7 @@ class for collection of PCIe data only supports Linux OS type.
448448

449449
- **SUPPORTED_OS_FAMILY**: `{<OSFamily.LINUX: 3>}`
450450
- **CMD_LSPCI_VERBOSE**: `lspci -vvv`
451-
- **CMD_LSPCI_TREE**: `lspci -vt`
451+
- **CMD_LSPCI_VERBOSE_TREE**: `lspci -vvvt`
452452
- **CMD_LSPCI_PATH**: `lspci -PP`
453453
- **CMD_LSPCI_HEX_SUDO**: `lspci -xxxx`
454454
- **CMD_LSPCI_HEX**: `lspci -x`
@@ -466,8 +466,8 @@ PcieDataModel
466466
- lspci -xxxx
467467
- lspci -PP
468468
- lspci -PP -d {vendor_id}:{dev_id}
469-
- lspci -vt
470469
- lspci -vvv
470+
- lspci -vvvt
471471

472472
## Collector Class ProcessCollector
473473

@@ -810,6 +810,8 @@ Pacakge data contains the package data for the system
810810
### Model annotations and fields
811811

812812
- **version_info**: `dict[str, str]`
813+
- **rocm_regex**: `str`
814+
- **enable_rocm_regex**: `bool`
813815

814816
## PcieDataModel Model
815817

@@ -1322,6 +1324,8 @@ Check sysctl matches expected sysctl details
13221324

13231325
- **exp_package_ver**: `Dict[str, Optional[str]]`
13241326
- **regex_match**: `bool`
1327+
- **rocm_regex**: `Optional[str]`
1328+
- **enable_rocm_regex**: `bool`
13251329

13261330
## Analyzer Args Class PcieAnalyzerArgs
13271331

nodescraper/interfaces/task.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,8 @@ def _build_event(
107107
data = {"task_name": self.__class__.__name__, "task_type": self.TASK_TYPE}
108108

109109
else:
110+
# Copy to avoid mutating the caller's dict
111+
data = copy.copy(data)
110112
data["task_name"] = self.__class__.__name__
111113
data["task_type"] = self.TASK_TYPE
112114

nodescraper/plugins/inband/device_enumeration/device_enumeration_collector.py

Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
from typing import Optional
2727

2828
from nodescraper.base import InBandDataCollector
29-
from nodescraper.connection.inband.inband import CommandArtifact
29+
from nodescraper.connection.inband.inband import CommandArtifact, TextFileArtifact
3030
from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus, OSFamily
3131
from nodescraper.models import TaskResult
3232

@@ -38,9 +38,10 @@ class DeviceEnumerationCollector(InBandDataCollector[DeviceEnumerationDataModel,
3838

3939
DATA_MODEL = DeviceEnumerationDataModel
4040

41-
CMD_CPU_COUNT_LINUX = "lscpu | grep Socket | awk '{ print $2 }'"
4241
CMD_GPU_COUNT_LINUX = "lspci -d {vendorid_ep}: | grep -i 'VGA\\|Display\\|3D' | wc -l"
4342
CMD_VF_COUNT_LINUX = "lspci -d {vendorid_ep}: | grep -i 'Virtual Function' | wc -l"
43+
CMD_LSCPU_LINUX = "lscpu"
44+
CMD_LSHW_LINUX = "lshw"
4445

4546
CMD_CPU_COUNT_WINDOWS = (
4647
'powershell -Command "(Get-WmiObject -Class Win32_Processor | Measure-Object).Count"'
@@ -61,9 +62,8 @@ def _warning(
6162
description=description,
6263
data={
6364
"command": command.command,
64-
"stdout": command.stdout,
65-
"stderr": command.stderr,
6665
"exit_code": command.exit_code,
66+
"stderr": command.stderr,
6767
},
6868
priority=EventPriority.WARNING,
6969
)
@@ -75,8 +75,7 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[DeviceEnumeratio
7575
On Windows, use WMI and hyper-v cmdlets
7676
"""
7777
if self.system_info.os_family == OSFamily.LINUX:
78-
# Count CPU sockets
79-
cpu_count_res = self._run_sut_cmd(self.CMD_CPU_COUNT_LINUX)
78+
lscpu_res = self._run_sut_cmd(self.CMD_LSCPU_LINUX, log_artifact=False)
8079

8180
# Count all AMD GPUs
8281
vendor_id = format(self.system_info.vendorid_ep, "x")
@@ -86,17 +85,42 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[DeviceEnumeratio
8685

8786
# Count AMD Virtual Functions
8887
vf_count_res = self._run_sut_cmd(self.CMD_VF_COUNT_LINUX.format(vendorid_ep=vendor_id))
88+
89+
# Collect lshw output
90+
lshw_res = self._run_sut_cmd(self.CMD_LSHW_LINUX, sudo=True, log_artifact=False)
8991
else:
9092
cpu_count_res = self._run_sut_cmd(self.CMD_CPU_COUNT_WINDOWS)
9193
gpu_count_res = self._run_sut_cmd(self.CMD_GPU_COUNT_WINDOWS)
9294
vf_count_res = self._run_sut_cmd(self.CMD_VF_COUNT_WINDOWS)
9395

9496
device_enum = DeviceEnumerationDataModel()
9597

96-
if cpu_count_res.exit_code == 0:
97-
device_enum.cpu_count = int(cpu_count_res.stdout)
98+
if self.system_info.os_family == OSFamily.LINUX:
99+
if lscpu_res.exit_code == 0 and lscpu_res.stdout:
100+
# Extract socket count from lscpu output
101+
for line in lscpu_res.stdout.splitlines():
102+
if line.startswith("Socket(s):"):
103+
try:
104+
device_enum.cpu_count = int(line.split(":")[1].strip())
105+
break
106+
except (ValueError, IndexError):
107+
self._warning(
108+
description="Cannot parse CPU count from lscpu output",
109+
command=lscpu_res,
110+
)
111+
device_enum.lscpu_output = lscpu_res.stdout
112+
self._log_event(
113+
category=EventCategory.PLATFORM,
114+
description="Collected lscpu output",
115+
priority=EventPriority.INFO,
116+
)
117+
else:
118+
self._warning(description="Cannot collect lscpu output", command=lscpu_res)
98119
else:
99-
self._warning(description="Cannot determine CPU count", command=cpu_count_res)
120+
if cpu_count_res.exit_code == 0:
121+
device_enum.cpu_count = int(cpu_count_res.stdout)
122+
else:
123+
self._warning(description="Cannot determine CPU count", command=cpu_count_res)
100124

101125
if gpu_count_res.exit_code == 0:
102126
device_enum.gpu_count = int(gpu_count_res.stdout)
@@ -112,14 +136,33 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[DeviceEnumeratio
112136
category=EventCategory.SW_DRIVER,
113137
)
114138

139+
# Collect lshw output on Linux
140+
if self.system_info.os_family == OSFamily.LINUX:
141+
if lshw_res.exit_code == 0 and lshw_res.stdout:
142+
device_enum.lshw_output = lshw_res.stdout
143+
self.result.artifacts.append(
144+
TextFileArtifact(filename="lshw.txt", contents=lshw_res.stdout)
145+
)
146+
self._log_event(
147+
category=EventCategory.PLATFORM,
148+
description="Collected lshw output",
149+
priority=EventPriority.INFO,
150+
)
151+
else:
152+
self._warning(description="Cannot collect lshw output", command=lshw_res)
153+
115154
if device_enum.cpu_count or device_enum.gpu_count or device_enum.vf_count:
155+
log_data = device_enum.model_dump(
156+
exclude_none=True,
157+
exclude={"lscpu_output", "lshw_output", "task_name", "task_type", "parent"},
158+
)
116159
self._log_event(
117160
category=EventCategory.PLATFORM,
118161
description=f"Counted {device_enum.cpu_count} CPUs, {device_enum.gpu_count} GPUs, {device_enum.vf_count} VFs",
119-
data=device_enum.model_dump(exclude_none=True),
162+
data=log_data,
120163
priority=EventPriority.INFO,
121164
)
122-
self.result.message = f"Device Enumeration: {device_enum.model_dump(exclude_none=True)}"
165+
self.result.message = f"Device Enumeration: {log_data}"
123166
self.result.status = ExecutionStatus.OK
124167
return self.result, device_enum
125168
else:

nodescraper/plugins/inband/device_enumeration/deviceenumdata.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,5 @@ class DeviceEnumerationDataModel(DataModel):
3232
cpu_count: Optional[int] = None
3333
gpu_count: Optional[int] = None
3434
vf_count: Optional[int] = None
35+
lscpu_output: Optional[str] = None
36+
lshw_output: Optional[str] = None

nodescraper/plugins/inband/dimm/dimm_collector.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
from typing import Optional
2727

2828
from nodescraper.base import InBandDataCollector
29+
from nodescraper.connection.inband import TextFileArtifact
2930
from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus, OSFamily
3031
from nodescraper.models import TaskResult
3132

@@ -40,6 +41,7 @@ class DimmCollector(InBandDataCollector[DimmDataModel, DimmCollectorArgs]):
4041

4142
CMD_WINDOWS = "wmic memorychip get Capacity"
4243
CMD = """sh -c 'dmidecode -t 17 | tr -s " " | grep -v "Volatile\\|None\\|Module" | grep Size' 2>/dev/null"""
44+
CMD_DMIDECODE_FULL = "dmidecode"
4345

4446
def collect_data(
4547
self,
@@ -72,6 +74,25 @@ def collect_data(
7274
self.result.message = "Skipping sudo plugin"
7375
self.result.status = ExecutionStatus.NOT_RAN
7476
return self.result, None
77+
78+
# Collect full dmidecode output as artifact
79+
dmidecode_full_res = self._run_sut_cmd(self.CMD_DMIDECODE_FULL, sudo=True)
80+
if dmidecode_full_res.exit_code == 0 and dmidecode_full_res.stdout:
81+
self.result.artifacts.append(
82+
TextFileArtifact(filename="dmidecode.txt", contents=dmidecode_full_res.stdout)
83+
)
84+
else:
85+
self._log_event(
86+
category=EventCategory.OS,
87+
description="Could not collect full dmidecode output",
88+
data={
89+
"command": dmidecode_full_res.command,
90+
"exit_code": dmidecode_full_res.exit_code,
91+
"stderr": dmidecode_full_res.stderr,
92+
},
93+
priority=EventPriority.WARNING,
94+
)
95+
7596
res = self._run_sut_cmd(self.CMD, sudo=True)
7697
if res.exit_code == 0:
7798
total = 0

nodescraper/plugins/inband/kernel/analyzer_args.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,4 +61,4 @@ def build_from_model(cls, datamodel: KernelDataModel) -> "KernelAnalyzerArgs":
6161
Returns:
6262
KernelAnalyzerArgs: instance of analyzer args class
6363
"""
64-
return cls(exp_kernel=datamodel.kernel_info)
64+
return cls(exp_kernel=datamodel.kernel_version)

nodescraper/plugins/inband/memory/analyzer_args.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,21 @@
2525
###############################################################################
2626
from nodescraper.models.analyzerargs import AnalyzerArgs
2727

28+
from .memorydata import MemoryDataModel
29+
2830

2931
class MemoryAnalyzerArgs(AnalyzerArgs):
3032
ratio: float = 0.66
3133
memory_threshold: str = "30Gi"
34+
35+
@classmethod
36+
def build_from_model(cls, datamodel: MemoryDataModel) -> "MemoryAnalyzerArgs":
37+
"""build analyzer args from data model
38+
39+
Args:
40+
datamodel (MemoryDataModel): data model for plugin
41+
42+
Returns:
43+
MemoryAnalyzerArgs: instance of analyzer args class
44+
"""
45+
return cls(memory_threshold=datamodel.mem_total)

nodescraper/plugins/inband/memory/memory_plugin.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,5 @@ class MemoryPlugin(InBandDataPlugin[MemoryDataModel, None, MemoryAnalyzerArgs]):
3939
COLLECTOR = MemoryCollector
4040

4141
ANALYZER = MemoryAnalyzer
42+
43+
ANALYZER_ARGS = MemoryAnalyzerArgs

nodescraper/plugins/inband/os/analyzer_args.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,4 +61,4 @@ def build_from_model(cls, datamodel: OsDataModel) -> "OsAnalyzerArgs":
6161
Returns:
6262
OsAnalyzerArgs: instance of analyzer args class
6363
"""
64-
return cls(exp_os=datamodel.os_name)
64+
return cls(exp_os=datamodel.os_name, exact_match=True)

nodescraper/plugins/inband/package/analyzer_args.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,15 @@
3434
class PackageAnalyzerArgs(AnalyzerArgs):
3535
exp_package_ver: Dict[str, Optional[str]] = Field(default_factory=dict)
3636
regex_match: bool = False
37+
# rocm_regex is optional and should be specified in plugin_config.json if needed
38+
rocm_regex: Optional[str] = None
39+
enable_rocm_regex: bool = False
3740

3841
@classmethod
3942
def build_from_model(cls, datamodel: PackageDataModel) -> "PackageAnalyzerArgs":
40-
return cls(exp_package_ver=datamodel.version_info)
43+
# Use custom rocm_regex from collection_args if enable_rocm_regex is true
44+
rocm_regex = None
45+
if datamodel.enable_rocm_regex and datamodel.rocm_regex:
46+
rocm_regex = datamodel.rocm_regex
47+
48+
return cls(exp_package_ver=datamodel.version_info, rocm_regex=rocm_regex)

0 commit comments

Comments
 (0)