Skip to content

Commit 084d6fa

Browse files
authored
feat: unit test metric tracking (#40)
Signed-off-by: Terry Kong <terryk@nvidia.com>
1 parent cca2aec commit 084d6fa

File tree

8 files changed

+347
-21
lines changed

8 files changed

+347
-21
lines changed

.github/workflows/_run_test.yml

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,16 @@ on:
2727
default: 10
2828
SCRIPT:
2929
type: string
30-
description: Test script to execute
30+
description: Test script to execute in container
3131
required: true
3232
AFTER_SCRIPT:
3333
type: string
34-
description: Script to run after main test
34+
description: Script to run after main test in container
35+
required: false
36+
default: ":"
37+
FINAL_SCRIPT_EXTERNAL:
38+
type: string
39+
description: Script to run after SCRIPT and AFTER_SCRIPT, but outside container (useful for logging)
3540
required: false
3641
default: ":"
3742
IS_OPTIONAL:
@@ -163,6 +168,15 @@ jobs:
163168
)
164169
docker exec nemo_container_${{ github.run_id }} bash -eux -o pipefail -c "$cmd"
165170
171+
- name: final_script_external
172+
if: always() && inputs.FINAL_SCRIPT_EXTERNAL != ':'
173+
run: |
174+
cmd=$(cat <<"RUN_TEST_EOF"
175+
${{ inputs.FINAL_SCRIPT_EXTERNAL }}
176+
RUN_TEST_EOF
177+
)
178+
bash -eux -o pipefail -c "$cmd"
179+
166180
- name: Container shutdown
167181
if: always()
168182
run: |

.github/workflows/cicd-main.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,13 @@ jobs:
154154
SCRIPT: |
155155
cd ${REINFORCER_REPO_DIR}
156156
uv run --extra test bash -x ./tests/run_unit.sh
157+
FINAL_SCRIPT_EXTERNAL: |
158+
cat <<EOF | tee -a $GITHUB_STEP_SUMMARY
159+
# Unit test results
160+
\`\`\`json
161+
$(cat tests/unit/unit_results.json || echo "n/a")
162+
\`\`\`
163+
EOF
157164
secrets:
158165
HF_TOKEN: ${{ secrets.HF_TOKEN }}
159166

docs/testing.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,71 @@ CONTAINER=... bash tests/run_unit_in_docker.sh
3232

3333
The required `CONTAINER` can be built by following the instructions in the [docker documentation](docker.md).
3434

35+
### Tracking metrics in unit tests
36+
37+
Unit tests may also log metrics to a fixture. The fixture is called `tracker` and has the following API:
38+
```python
39+
# Track an arbitrary metric (must be json serializable)
40+
tracker.track(metric_name, metric_value)
41+
# Log the maximum memory across the entire cluster. Okay for tests since they are run serially.
42+
tracker.log_max_mem(metric_name)
43+
# Returns the maximum memory. Useful if you are measuring changes in memory.
44+
tracker.get_max_mem()
45+
```
46+
47+
Including the `tracker` fixture also tracks the elapsed time for the test implicitly.
48+
49+
Here is an example test:
50+
```python
51+
def test_exponentiate(tracker):
52+
starting_mem = tracker.get_max_mem()
53+
base = 2
54+
exponent = 4
55+
result = base ** exponent
56+
tracker.track("result", result)
57+
tracker.log_max_mem("memory_after_exponentiating")
58+
change_in_mem = tracker.get_max_mem() - starting_mem
59+
tracker.track("change_in_mem", change_in_mem)
60+
assert result == 16
61+
```
62+
63+
Which would produce this file in `tests/unit/unit_results.json`:
64+
```json
65+
{
66+
"exit_status": 0,
67+
"git_commit": "f1062bd3fd95fc64443e2d9ee4a35fc654ba897e",
68+
"start_time": "2025-03-24 23:34:12",
69+
"metrics": {
70+
"test_hf_ray_policy::test_hf_policy_generation": {
71+
"avg_prob_mult_error": 1.0000039339065552,
72+
"mean_lps": -1.5399343967437744,
73+
"_elapsed": 17.323044061660767
74+
}
75+
},
76+
"gpu_types": [
77+
"NVIDIA H100 80GB HBM3"
78+
],
79+
"coverage": 24.55897613282601
80+
}
81+
```
82+
83+
:::{tip}
84+
Past unit test results are logged in `tests/unit/unit_results/`. These are helpful to view trends over time and commits.
85+
86+
Here's an example `jq` command to view trends:
87+
88+
```sh
89+
jq -r '[.start_time, .git_commit, .metrics["test_hf_ray_policy::test_hf_policy_generation"].avg_prob_mult_error] | @tsv' tests/unit/unit_results/*
90+
91+
# Example output:
92+
#2025-03-24 23:35:39 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
93+
#2025-03-24 23:36:37 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
94+
#2025-03-24 23:37:37 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
95+
#2025-03-24 23:38:14 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
96+
#2025-03-24 23:38:50 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
97+
```
98+
:::
99+
35100
## Functional tests
36101

37102
:::{important}

nemo_reinforcer/utils/logger.py

Lines changed: 82 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,11 @@ def log_hyperparams(self, params: Dict[str, Any]) -> None:
144144
self.run.config.update(params)
145145

146146

147+
class GpuMetricSnapshot(TypedDict):
148+
step: int
149+
metrics: Dict[str, Any]
150+
151+
147152
class RayGpuMonitorLogger:
148153
"""Monitor GPU utilization across a Ray cluster and log metrics to a parent logger."""
149154

@@ -163,7 +168,9 @@ def __init__(
163168
self.collection_interval = collection_interval
164169
self.flush_interval = flush_interval
165170
self.parent_logger = parent_logger
166-
self.metrics_buffer = [] # Store metrics with timestamps
171+
self.metrics_buffer: list[
172+
GpuMetricSnapshot
173+
] = [] # Store metrics with timestamps
167174
self.last_flush_time = time.time()
168175
self.is_running = False
169176
self.collection_thread = None
@@ -228,7 +235,9 @@ def _collection_loop(self):
228235

229236
time.sleep(self.collection_interval)
230237
except Exception as e:
231-
print(f"Error in GPU monitoring collection loop: {e}")
238+
print(
239+
f"Error in GPU monitoring collection loop or stopped abruptly: {e}"
240+
)
232241
time.sleep(self.collection_interval) # Continue despite errors
233242

234243
def _parse_gpu_metric(self, sample: Sample, node_idx: int) -> Dict[str, Any]:
@@ -241,7 +250,6 @@ def _parse_gpu_metric(self, sample: Sample, node_idx: int) -> Dict[str, Any]:
241250
Returns:
242251
Dictionary with metric name and value
243252
"""
244-
# TODO: Consider plumbing {'GpuDeviceName': 'NVIDIA H100 80GB HBM3'}
245253
# Expected labels for GPU metrics
246254
expected_labels = ["GpuIndex"]
247255
for label in expected_labels:
@@ -266,12 +274,72 @@ def _parse_gpu_metric(self, sample: Sample, node_idx: int) -> Dict[str, Any]:
266274
metric_name = f"node.{node_idx}.gpu.{index}.{metric_name}"
267275
return {metric_name: value}
268276

277+
def _parse_gpu_sku(self, sample: Sample, node_idx: int) -> Dict[str, str]:
278+
"""Parse a GPU metric sample into a standardized format.
279+
280+
Args:
281+
sample: Prometheus metric sample
282+
node_idx: Index of the node
283+
284+
Returns:
285+
Dictionary with metric name and value
286+
"""
287+
# TODO: Consider plumbing {'GpuDeviceName': 'NVIDIA H100 80GB HBM3'}
288+
# Expected labels for GPU metrics
289+
expected_labels = ["GpuIndex", "GpuDeviceName"]
290+
for label in expected_labels:
291+
if label not in sample.labels:
292+
# This is probably a CPU node
293+
return {}
294+
295+
metric_name = sample.name
296+
# Only return SKU if the metric is one of these which publish these metrics
297+
if (
298+
metric_name != "ray_node_gpus_utilization"
299+
and metric_name != "ray_node_gram_used"
300+
):
301+
# Skip unexpected metrics
302+
return {}
303+
304+
labels = sample.labels
305+
index = labels["GpuIndex"]
306+
value = labels["GpuDeviceName"]
307+
308+
metric_name = f"node.{node_idx}.gpu.{index}.type"
309+
return {metric_name: value}
310+
311+
def _collect_gpu_sku(self) -> Dict[str, str]:
312+
"""Collect GPU SKU from all Ray nodes.
313+
314+
Note: This is an internal API and users are not expected to call this.
315+
316+
Returns:
317+
Dictionary of SKU types on all Ray nodes
318+
"""
319+
# TODO: We can re-use the same path for metrics because even though both utilization and memory metrics duplicate
320+
# the GPU metadata information; since the metadata is the same for each node, we can overwrite it and expect them to
321+
# be the same
322+
return self._collect(sku=True)
323+
269324
def _collect_metrics(self) -> Dict[str, Any]:
270325
"""Collect GPU metrics from all Ray nodes.
271326
272327
Returns:
273328
Dictionary of collected metrics
274329
"""
330+
return self._collect(metrics=True)
331+
332+
def _collect(self, metrics: bool = False, sku: bool = False) -> Dict[str, Any]:
333+
"""Collect GPU metrics from all Ray nodes.
334+
335+
Returns:
336+
Dictionary of collected metrics
337+
"""
338+
assert metrics ^ sku, (
339+
f"Must collect either metrics or sku, not both: {metrics=}, {sku=}"
340+
)
341+
parser_fn = self._parse_gpu_metric if metrics else self._parse_gpu_sku
342+
275343
if not ray.is_initialized():
276344
print("Ray is not initialized. Cannot collect GPU metrics.")
277345
return {}
@@ -295,7 +363,9 @@ def _collect_metrics(self) -> Dict[str, Any]:
295363
# Process each node's metrics
296364
collected_metrics = {}
297365
for node_idx, metric_address in enumerate(unique_metric_addresses):
298-
gpu_metrics = self._fetch_and_parse_metrics(node_idx, metric_address)
366+
gpu_metrics = self._fetch_and_parse_metrics(
367+
node_idx, metric_address, parser_fn
368+
)
299369
collected_metrics.update(gpu_metrics)
300370

301371
return collected_metrics
@@ -304,7 +374,7 @@ def _collect_metrics(self) -> Dict[str, Any]:
304374
print(f"Error collecting GPU metrics: {e}")
305375
return {}
306376

307-
def _fetch_and_parse_metrics(self, node_idx, metric_address):
377+
def _fetch_and_parse_metrics(self, node_idx, metric_address, parser_fn):
308378
"""Fetch metrics from a node and parse GPU metrics.
309379
310380
Args:
@@ -335,7 +405,7 @@ def _fetch_and_parse_metrics(self, node_idx, metric_address):
335405
continue
336406

337407
for sample in family.samples:
338-
metrics = self._parse_gpu_metric(sample, node_idx)
408+
metrics = parser_fn(sample, node_idx)
339409
gpu_metrics.update(metrics)
340410

341411
return gpu_metrics
@@ -346,18 +416,16 @@ def _fetch_and_parse_metrics(self, node_idx, metric_address):
346416

347417
def flush(self):
348418
"""Flush collected metrics to the parent logger."""
349-
if not self.parent_logger:
350-
return
351-
352419
with self.lock:
353420
if not self.metrics_buffer:
354421
return
355422

356-
# Log each set of metrics with its original step
357-
for entry in self.metrics_buffer:
358-
step = entry["step"]
359-
metrics = entry["metrics"]
360-
self.parent_logger.log_metrics(metrics, step, prefix="ray")
423+
if self.parent_logger:
424+
# Log each set of metrics with its original step
425+
for entry in self.metrics_buffer:
426+
step = entry["step"]
427+
metrics = entry["metrics"]
428+
self.parent_logger.log_metrics(metrics, step, prefix="ray")
361429

362430
# Clear buffer after logging
363431
self.metrics_buffer = []

tests/run_unit.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ export RAY_DEDUP_LOGS=0
3333

3434
# Run unit tests
3535
echo "Running unit tests..."
36-
if ! pytest unit/ -s -rA "$@"; then
36+
if ! pytest unit/ --cov=nemo_reinforcer --cov-report=term --cov-report=json -s -rA "$@"; then
3737
echo "[ERROR]: Unit tests failed."
3838
exit 1
3939
fi

0 commit comments

Comments
 (0)