Skip to content

Commit 6288e8b

Browse files
Merge branch 'development' into alex_dmidecode
2 parents bcba1e7 + 95d6ac9 commit 6288e8b

File tree

13 files changed

+359
-42
lines changed

13 files changed

+359
-42
lines changed

docs/PLUGIN_DOC.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
| MemoryPlugin | free -b<br>/usr/bin/lsmem<br>wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value | - | [MemoryDataModel](#MemoryDataModel-Model) | [MemoryCollector](#Collector-Class-MemoryCollector) | [MemoryAnalyzer](#Data-Analyzer-Class-MemoryAnalyzer) |
1818
| NvmePlugin | nvme smart-log {dev}<br>nvme error-log {dev} --log-entries=256<br>nvme id-ctrl {dev}<br>nvme id-ns {dev}{ns}<br>nvme fw-log {dev}<br>nvme self-test-log {dev}<br>nvme get-log {dev} --log-id=6 --log-len=512<br>nvme telemetry-log {dev} --output-file={dev}_{f_name} | - | [NvmeDataModel](#NvmeDataModel-Model) | [NvmeCollector](#Collector-Class-NvmeCollector) | - |
1919
| OsPlugin | sh -c '( lsb_release -ds \|\| (cat /etc/*release \| grep PRETTY_NAME) \|\| uname -om ) 2>/dev/null \| head -n1'<br>cat /etc/*release \| grep VERSION_ID<br>wmic os get Version /value<br>wmic os get Caption /Value | **Analyzer Args:**<br>- `exp_os`: Union[str, list]<br>- `exact_match`: bool | [OsDataModel](#OsDataModel-Model) | [OsCollector](#Collector-Class-OsCollector) | [OsAnalyzer](#Data-Analyzer-Class-OsAnalyzer) |
20-
| PackagePlugin | dnf list --installed<br>dpkg-query -W<br>pacman -Q<br>cat /etc/*release<br>wmic product get name,version | **Analyzer Args:**<br>- `exp_package_ver`: Dict[str, Optional[str]]<br>- `regex_match`: bool | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) |
21-
| PciePlugin | lspci -d {vendor_id}: -nn<br>lspci -x<br>lspci -xxxx<br>lspci -PP<br>lspci -PP -d {vendor_id}:{dev_id}<br>lspci -vt<br>lspci -vvv | **Analyzer Args:**<br>- `exp_speed`: int<br>- `exp_width`: int<br>- `exp_sriov_count`: int<br>- `exp_gpu_count_override`: Optional[int]<br>- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) |
20+
| PackagePlugin | dnf list --installed<br>dpkg-query -W<br>pacman -Q<br>cat /etc/*release<br>wmic product get name,version | **Analyzer Args:**<br>- `exp_package_ver`: Dict[str, Optional[str]]<br>- `regex_match`: bool<br>- `rocm_regex`: Optional[str]<br>- `enable_rocm_regex`: bool | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) |
21+
| PciePlugin | lspci -d {vendor_id}: -nn<br>lspci -x<br>lspci -xxxx<br>lspci -PP<br>lspci -PP -d {vendor_id}:{dev_id}<br>lspci -vvv<br>lspci -vvvt | **Analyzer Args:**<br>- `exp_speed`: int<br>- `exp_width`: int<br>- `exp_sriov_count`: int<br>- `exp_gpu_count_override`: Optional[int]<br>- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType]<br>- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) |
2222
| ProcessPlugin | top -b -n 1<br>rocm-smi --showpids<br>top -b -n 1 -o %CPU | **Analyzer Args:**<br>- `max_kfd_processes`: int<br>- `max_cpu_usage`: float | [ProcessDataModel](#ProcessDataModel-Model) | [ProcessCollector](#Collector-Class-ProcessCollector) | [ProcessAnalyzer](#Data-Analyzer-Class-ProcessAnalyzer) |
2323
| RocmPlugin | {rocm_path}/opencl/bin/*/clinfo<br>env \| grep -Ei 'rocm\|hsa\|hip\|mpi\|openmp\|ucx\|miopen'<br>ls /sys/class/kfd/kfd/proc/<br>grep -i -E 'rocm' /etc/ld.so.conf.d/*<br>{rocm_path}/bin/rocminfo<br>ls -v -d /opt/rocm*<br>ls -v -d /opt/rocm-[3-7]* \| tail -1<br>ldconfig -p \| grep -i -E 'rocm'<br>/opt/rocm/.info/version-rocm<br>/opt/rocm/.info/version | **Analyzer Args:**<br>- `exp_rocm`: Union[str, list]<br>- `exp_rocm_latest`: str | [RocmDataModel](#RocmDataModel-Model) | [RocmCollector](#Collector-Class-RocmCollector) | [RocmAnalyzer](#Data-Analyzer-Class-RocmAnalyzer) |
2424
| StoragePlugin | sh -c 'df -lH -B1 \| grep -v 'boot''<br>wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace | - | [StorageDataModel](#StorageDataModel-Model) | [StorageCollector](#Collector-Class-StorageCollector) | [StorageAnalyzer](#Data-Analyzer-Class-StorageAnalyzer) |
@@ -428,7 +428,7 @@ class for collection of PCIe data only supports Linux OS type.
428428

429429
This class will collect important PCIe data from the system running the commands
430430
- `lspci -vvv` : Verbose collection of PCIe data
431-
- `lspci -vt`: Tree view of PCIe data
431+
- `lspci -vvvt`: Verbose tree view of PCIe data
432432
- `lspci -PP`: Path view of PCIe data for the GPUs
433433
- If system interaction level is set to STANDARD or higher, the following commands will be run with sudo:
434434
- `lspci -xxxx`: Hex view of PCIe data for the GPUs
@@ -448,7 +448,7 @@ class for collection of PCIe data only supports Linux OS type.
448448

449449
- **SUPPORTED_OS_FAMILY**: `{<OSFamily.LINUX: 3>}`
450450
- **CMD_LSPCI_VERBOSE**: `lspci -vvv`
451-
- **CMD_LSPCI_TREE**: `lspci -vt`
451+
- **CMD_LSPCI_VERBOSE_TREE**: `lspci -vvvt`
452452
- **CMD_LSPCI_PATH**: `lspci -PP`
453453
- **CMD_LSPCI_HEX_SUDO**: `lspci -xxxx`
454454
- **CMD_LSPCI_HEX**: `lspci -x`
@@ -466,8 +466,8 @@ PcieDataModel
466466
- lspci -xxxx
467467
- lspci -PP
468468
- lspci -PP -d {vendor_id}:{dev_id}
469-
- lspci -vt
470469
- lspci -vvv
470+
- lspci -vvvt
471471

472472
## Collector Class ProcessCollector
473473

@@ -810,6 +810,8 @@ Pacakge data contains the package data for the system
810810
### Model annotations and fields
811811

812812
- **version_info**: `dict[str, str]`
813+
- **rocm_regex**: `str`
814+
- **enable_rocm_regex**: `bool`
813815

814816
## PcieDataModel Model
815817

@@ -1322,6 +1324,8 @@ Check sysctl matches expected sysctl details
13221324

13231325
- **exp_package_ver**: `Dict[str, Optional[str]]`
13241326
- **regex_match**: `bool`
1327+
- **rocm_regex**: `Optional[str]`
1328+
- **enable_rocm_regex**: `bool`
13251329

13261330
## Analyzer Args Class PcieAnalyzerArgs
13271331

nodescraper/interfaces/task.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,8 @@ def _build_event(
107107
data = {"task_name": self.__class__.__name__, "task_type": self.TASK_TYPE}
108108

109109
else:
110+
# Copy to avoid mutating the caller's dict
111+
data = copy.copy(data)
110112
data["task_name"] = self.__class__.__name__
111113
data["task_type"] = self.TASK_TYPE
112114

nodescraper/plugins/inband/device_enumeration/device_enumeration_collector.py

Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
from typing import Optional
2727

2828
from nodescraper.base import InBandDataCollector
29-
from nodescraper.connection.inband.inband import CommandArtifact
29+
from nodescraper.connection.inband.inband import CommandArtifact, TextFileArtifact
3030
from nodescraper.enums import EventCategory, EventPriority, ExecutionStatus, OSFamily
3131
from nodescraper.models import TaskResult
3232

@@ -38,9 +38,10 @@ class DeviceEnumerationCollector(InBandDataCollector[DeviceEnumerationDataModel,
3838

3939
DATA_MODEL = DeviceEnumerationDataModel
4040

41-
CMD_CPU_COUNT_LINUX = "lscpu | grep Socket | awk '{ print $2 }'"
4241
CMD_GPU_COUNT_LINUX = "lspci -d {vendorid_ep}: | grep -i 'VGA\\|Display\\|3D' | wc -l"
4342
CMD_VF_COUNT_LINUX = "lspci -d {vendorid_ep}: | grep -i 'Virtual Function' | wc -l"
43+
CMD_LSCPU_LINUX = "lscpu"
44+
CMD_LSHW_LINUX = "lshw"
4445

4546
CMD_CPU_COUNT_WINDOWS = (
4647
'powershell -Command "(Get-WmiObject -Class Win32_Processor | Measure-Object).Count"'
@@ -61,9 +62,8 @@ def _warning(
6162
description=description,
6263
data={
6364
"command": command.command,
64-
"stdout": command.stdout,
65-
"stderr": command.stderr,
6665
"exit_code": command.exit_code,
66+
"stderr": command.stderr,
6767
},
6868
priority=EventPriority.WARNING,
6969
)
@@ -75,8 +75,7 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[DeviceEnumeratio
7575
On Windows, use WMI and hyper-v cmdlets
7676
"""
7777
if self.system_info.os_family == OSFamily.LINUX:
78-
# Count CPU sockets
79-
cpu_count_res = self._run_sut_cmd(self.CMD_CPU_COUNT_LINUX)
78+
lscpu_res = self._run_sut_cmd(self.CMD_LSCPU_LINUX, log_artifact=False)
8079

8180
# Count all AMD GPUs
8281
vendor_id = format(self.system_info.vendorid_ep, "x")
@@ -86,17 +85,42 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[DeviceEnumeratio
8685

8786
# Count AMD Virtual Functions
8887
vf_count_res = self._run_sut_cmd(self.CMD_VF_COUNT_LINUX.format(vendorid_ep=vendor_id))
88+
89+
# Collect lshw output
90+
lshw_res = self._run_sut_cmd(self.CMD_LSHW_LINUX, sudo=True, log_artifact=False)
8991
else:
9092
cpu_count_res = self._run_sut_cmd(self.CMD_CPU_COUNT_WINDOWS)
9193
gpu_count_res = self._run_sut_cmd(self.CMD_GPU_COUNT_WINDOWS)
9294
vf_count_res = self._run_sut_cmd(self.CMD_VF_COUNT_WINDOWS)
9395

9496
device_enum = DeviceEnumerationDataModel()
9597

96-
if cpu_count_res.exit_code == 0:
97-
device_enum.cpu_count = int(cpu_count_res.stdout)
98+
if self.system_info.os_family == OSFamily.LINUX:
99+
if lscpu_res.exit_code == 0 and lscpu_res.stdout:
100+
# Extract socket count from lscpu output
101+
for line in lscpu_res.stdout.splitlines():
102+
if line.startswith("Socket(s):"):
103+
try:
104+
device_enum.cpu_count = int(line.split(":")[1].strip())
105+
break
106+
except (ValueError, IndexError):
107+
self._warning(
108+
description="Cannot parse CPU count from lscpu output",
109+
command=lscpu_res,
110+
)
111+
device_enum.lscpu_output = lscpu_res.stdout
112+
self._log_event(
113+
category=EventCategory.PLATFORM,
114+
description="Collected lscpu output",
115+
priority=EventPriority.INFO,
116+
)
117+
else:
118+
self._warning(description="Cannot collect lscpu output", command=lscpu_res)
98119
else:
99-
self._warning(description="Cannot determine CPU count", command=cpu_count_res)
120+
if cpu_count_res.exit_code == 0:
121+
device_enum.cpu_count = int(cpu_count_res.stdout)
122+
else:
123+
self._warning(description="Cannot determine CPU count", command=cpu_count_res)
100124

101125
if gpu_count_res.exit_code == 0:
102126
device_enum.gpu_count = int(gpu_count_res.stdout)
@@ -112,14 +136,33 @@ def collect_data(self, args=None) -> tuple[TaskResult, Optional[DeviceEnumeratio
112136
category=EventCategory.SW_DRIVER,
113137
)
114138

139+
# Collect lshw output on Linux
140+
if self.system_info.os_family == OSFamily.LINUX:
141+
if lshw_res.exit_code == 0 and lshw_res.stdout:
142+
device_enum.lshw_output = lshw_res.stdout
143+
self.result.artifacts.append(
144+
TextFileArtifact(filename="lshw.txt", contents=lshw_res.stdout)
145+
)
146+
self._log_event(
147+
category=EventCategory.PLATFORM,
148+
description="Collected lshw output",
149+
priority=EventPriority.INFO,
150+
)
151+
else:
152+
self._warning(description="Cannot collect lshw output", command=lshw_res)
153+
115154
if device_enum.cpu_count or device_enum.gpu_count or device_enum.vf_count:
155+
log_data = device_enum.model_dump(
156+
exclude_none=True,
157+
exclude={"lscpu_output", "lshw_output", "task_name", "task_type", "parent"},
158+
)
116159
self._log_event(
117160
category=EventCategory.PLATFORM,
118161
description=f"Counted {device_enum.cpu_count} CPUs, {device_enum.gpu_count} GPUs, {device_enum.vf_count} VFs",
119-
data=device_enum.model_dump(exclude_none=True),
162+
data=log_data,
120163
priority=EventPriority.INFO,
121164
)
122-
self.result.message = f"Device Enumeration: {device_enum.model_dump(exclude_none=True)}"
165+
self.result.message = f"Device Enumeration: {log_data}"
123166
self.result.status = ExecutionStatus.OK
124167
return self.result, device_enum
125168
else:

nodescraper/plugins/inband/device_enumeration/deviceenumdata.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,5 @@ class DeviceEnumerationDataModel(DataModel):
3232
cpu_count: Optional[int] = None
3333
gpu_count: Optional[int] = None
3434
vf_count: Optional[int] = None
35+
lscpu_output: Optional[str] = None
36+
lshw_output: Optional[str] = None

nodescraper/plugins/inband/package/analyzer_args.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,15 @@
3434
class PackageAnalyzerArgs(AnalyzerArgs):
3535
exp_package_ver: Dict[str, Optional[str]] = Field(default_factory=dict)
3636
regex_match: bool = False
37+
# rocm_regex is optional and should be specified in plugin_config.json if needed
38+
rocm_regex: Optional[str] = None
39+
enable_rocm_regex: bool = False
3740

3841
@classmethod
3942
def build_from_model(cls, datamodel: PackageDataModel) -> "PackageAnalyzerArgs":
40-
return cls(exp_package_ver=datamodel.version_info)
43+
# Use custom rocm_regex from collection_args if enable_rocm_regex is true
44+
rocm_regex = None
45+
if datamodel.enable_rocm_regex and datamodel.rocm_regex:
46+
rocm_regex = datamodel.rocm_regex
47+
48+
return cls(exp_package_ver=datamodel.version_info, rocm_regex=rocm_regex)

nodescraper/plugins/inband/package/package_analyzer.py

Lines changed: 58 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def regex_version_data(
4444
package_data: dict[str, str],
4545
key_search: re.Pattern[str],
4646
value_search: Optional[Pattern[str]],
47-
) -> bool:
47+
) -> tuple[bool, list[tuple[str, str, str]]]:
4848
"""Searches the package values for the key and value search patterns
4949
5050
Args:
@@ -53,10 +53,12 @@ def regex_version_data(
5353
value_search (Optional[Pattern[str]]): a compiled regex pattern to search for the package version, if None then any version is accepted
5454
5555
Returns:
56-
bool: A boolean indicating if the value was found
56+
tuple: (value_found, version_mismatches) where value_found is a bool and
57+
version_mismatches is a list of (package_name, expected_pattern, found_version) tuples
5758
"""
5859

5960
value_found = False
61+
version_mismatches = []
6062
for name, version in package_data.items():
6163
self.logger.debug("Package data: %s, %s", name, version)
6264
key_search_res = key_search.search(name)
@@ -66,6 +68,7 @@ def regex_version_data(
6668
continue
6769
value_search_res = value_search.search(version)
6870
if not value_search_res:
71+
version_mismatches.append((name, value_search.pattern, version))
6972
self._log_event(
7073
EventCategory.APPLICATION,
7174
f"Package {key_search.pattern} Version Mismatch, Expected {value_search.pattern} but found {version}",
@@ -77,7 +80,7 @@ def regex_version_data(
7780
"found_version": version,
7881
},
7982
)
80-
return value_found
83+
return value_found, version_mismatches
8184

8285
def package_regex_search(
8386
self, package_data: dict[str, str], exp_package_data: dict[str, Optional[str]]
@@ -87,16 +90,23 @@ def package_regex_search(
8790
Args:
8891
package_data (dict[str, str]): a dictionary of package names and versions
8992
exp_package_data (dict[str, Optional[str]]): a dictionary of expected package names and versions
93+
94+
Returns:
95+
tuple: (not_found_keys, regex_errors, version_mismatches) containing lists of errors
9096
"""
9197
not_found_keys = []
98+
regex_errors = []
99+
version_mismatches = []
100+
92101
for exp_key, exp_value in exp_package_data.items():
93102
try:
94103
if exp_value is not None:
95104
value_search = re.compile(exp_value)
96105
else:
97106
value_search = None
98107
key_search = re.compile(exp_key)
99-
except re.error:
108+
except re.error as e:
109+
regex_errors.append((exp_key, exp_value, str(e)))
100110
self._log_event(
101111
EventCategory.RUNTIME,
102112
f"Regex Compile Error either {exp_key} {exp_value}",
@@ -108,10 +118,13 @@ def package_regex_search(
108118
)
109119
continue
110120

111-
key_found = self.regex_version_data(package_data, key_search, value_search)
121+
key_found, mismatches = self.regex_version_data(package_data, key_search, value_search)
122+
123+
# Collect version mismatches
124+
version_mismatches.extend(mismatches)
112125

113126
if not key_found:
114-
not_found_keys.append(exp_key)
127+
not_found_keys.append((exp_key, exp_value))
115128
self._log_event(
116129
EventCategory.APPLICATION,
117130
f"Package {exp_key} not found in the package list",
@@ -123,7 +136,8 @@ def package_regex_search(
123136
"found_version": None,
124137
},
125138
)
126-
return not_found_keys
139+
140+
return not_found_keys, regex_errors, version_mismatches
127141

128142
def package_exact_match(
129143
self, package_data: dict[str, str], exp_package_data: dict[str, Optional[str]]
@@ -190,9 +204,43 @@ def analyze_data(
190204
return self.result
191205

192206
if args.regex_match:
193-
not_found_keys = self.package_regex_search(data.version_info, args.exp_package_ver)
194-
self.result.message = f"Packages not found: {not_found_keys}"
195-
self.result.status = ExecutionStatus.ERROR
207+
not_found_keys, regex_errors, version_mismatches = self.package_regex_search(
208+
data.version_info, args.exp_package_ver
209+
)
210+
211+
# Adding details for err message
212+
error_parts = []
213+
if not_found_keys:
214+
packages_detail = ", ".join(
215+
[
216+
f"'{pkg}' (expected version: {ver if ver else 'any'})"
217+
for pkg, ver in not_found_keys
218+
]
219+
)
220+
error_parts.append(f"Packages not found: {packages_detail}")
221+
222+
if regex_errors:
223+
regex_detail = ", ".join(
224+
[f"'{pkg}' pattern (version: {ver})" for pkg, ver, _ in regex_errors]
225+
)
226+
error_parts.append(f"Regex compile errors: {regex_detail}")
227+
228+
if version_mismatches:
229+
version_detail = ", ".join(
230+
[
231+
f"'{pkg}' (expected: {exp}, found: {found})"
232+
for pkg, exp, found in version_mismatches
233+
]
234+
)
235+
error_parts.append(f"Version mismatches: {version_detail}")
236+
237+
total_errors = len(not_found_keys) + len(regex_errors) + len(version_mismatches)
238+
if total_errors > 0:
239+
self.result.message = f"{'; '.join(error_parts)}"
240+
self.result.status = ExecutionStatus.ERROR
241+
else:
242+
self.result.message = "All packages found and versions matched"
243+
self.result.status = ExecutionStatus.OK
196244
else:
197245
self.logger.info("Expected packages: %s", list(args.exp_package_ver.keys()))
198246
not_found_match, not_found_version = self.package_exact_match(

0 commit comments

Comments
 (0)