Skip to content

Commit 53a159e

Browse files
gchlebusprokotg
andauthored
feat(interceptors): add reasoning ratio stats (#618)
- Introduced a new statistic, `reasoning_unfinished_count`, `reasoning_finished_ratio`, to track responses where reasoning started but did not complete and finished ratio to all reasoning responses. - Updated the logic in `ResponseReasoningInterceptor` to increment this count appropriately. - Added unit tests to validate the correct tracking of reasoning states, ensuring the mathematical invariant between started and finished counts is maintained. - Updated documentation to reflect the new statistic and its significance in evaluating reasoning performance. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added two reasoning metrics: reasoning_unfinished_count (counts started-but-incomplete reasoning) and reasoning_finished_ratio (fraction of completed reasoning). * **Documentation** * Updated evaluation, interceptor, and tutorial docs to include the new metrics in examples, metric tables, and artifact descriptions. * **Tests** * Added parameterized tests covering finished, unfinished, not-started, explicit-content, and edge-case reasoning scenarios to validate counts and ratio. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com> Signed-off-by: Tomasz Grzegorzek <tgrzegorzek@nvidia.com> Co-authored-by: Tomasz Grzegorzek <tgrzegorzek@nvidia.com>
1 parent 94f7b5c commit 53a159e

File tree

5 files changed

+183
-5
lines changed

5 files changed

+183
-5
lines changed

docs/evaluation/run-evals/reasoning.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,9 @@ When the reasoning interceptor is enabled, this file contains a `reasoning` key
227227
"total_responses": 3672,
228228
"responses_with_reasoning": 2860,
229229
"reasoning_finished_count": 2860,
230+
"reasoning_finished_ratio": 1.0,
230231
"reasoning_started_count": 2860,
232+
"reasoning_unfinished_count": 0,
231233
"avg_reasoning_words": 153.21,
232234
"avg_original_content_words": 192.17,
233235
"avg_updated_content_words": 38.52,
@@ -248,7 +250,7 @@ When the reasoning interceptor is enabled, this file contains a `reasoning` key
248250

249251
In the example above, the model used reasoning for 2860 out of 3672 responses (approximately 78%).
250252

251-
The matching values for `reasoning_started_count` and `reasoning_finished_count` indicate that the `max_new_tokens` parameter was set sufficiently high, allowing the model to complete all reasoning traces without truncation.
253+
The matching values for `reasoning_started_count` and `reasoning_finished_count` (and `reasoning_unfinished_count` being 0) indicate that the `max_new_tokens` parameter was set sufficiently high, allowing the model to complete all reasoning traces without truncation.
252254

253255
These statistics also enable cost analysis for reasoning operations.
254256
While the endpoint in this example does not return reasoning token usage statistics (the `*_tokens` fields are null or zero), you can still analyze computational cost using the word count metrics from the responses.

docs/libraries/nemo-evaluator/interceptors/reasoning.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,9 @@ The interceptor automatically tracks the following statistics:
7575
| `total_responses` | Total number of responses processed |
7676
| `responses_with_reasoning` | Number of responses containing reasoning content |
7777
| `reasoning_finished_count` | Number of responses where reasoning completed (end token found) |
78+
| `reasoning_finished_ratio` | Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoning |
7879
| `reasoning_started_count` | Number of responses where reasoning started |
80+
| `reasoning_unfinished_count` | Number of responses where reasoning started but did not complete (end token not found) |
7981
| `avg_reasoning_words` | Average word count in reasoning content |
8082
| `avg_reasoning_tokens` | Average token count in reasoning content |
8183
| `avg_original_content_words` | Average word count in original content (before processing) |

docs/tutorials/how-to/reasoning.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,8 @@ After evaluation completes, check these key artifacts:
226226
- **`eval_factory_metrics.json`**: Contains reasoning statistics under the `reasoning` key, including:
227227
- `responses_with_reasoning`: How many responses included reasoning traces
228228
- `reasoning_finished_count` vs `reasoning_started_count`: If these match, your `max_new_tokens` was sufficient
229+
- `reasoning_unfinished_count`: Number of responses where reasoning started but was truncated (didn't reach end token)
230+
- `reasoning_finished_ratio`: Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoning
229231
- `avg_reasoning_words` and other word- and tokens count metrics: Use these for cost analysis
230232

231233
:::{tip}

packages/nemo-evaluator/src/nemo_evaluator/adapters/interceptors/reasoning_interceptor.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,8 @@ def __init__(self, params: Params):
127127
"responses_with_reasoning": 0,
128128
"reasoning_finished_count": 0,
129129
"reasoning_started_count": 0,
130+
"reasoning_unfinished_count": 0,
131+
"reasoning_finished_ratio": 0,
130132
"avg_reasoning_words": None,
131133
"avg_original_content_words": None,
132134
"avg_updated_content_words": None,
@@ -281,12 +283,18 @@ def _update_reasoning_stats(self, reasoning_info: dict) -> None:
281283
)
282284

283285
# Increment counters
284-
if reasoning_words > 0:
286+
if (
287+
reasoning_words == "unknown"
288+
and reasoning_info.get("reasoning_started") is True
289+
) or (isinstance(reasoning_words, int) and reasoning_words > 0):
290+
# if reasoning started but not finished, or finished and we have non-zero reasoning words
285291
self._reasoning_stats["responses_with_reasoning"] += 1
286-
if reasoning_info.get("reasoning_started"):
292+
if reasoning_info.get("reasoning_started") is True:
287293
self._reasoning_stats["reasoning_started_count"] += 1
288-
if reasoning_info.get("reasoning_finished"):
289-
self._reasoning_stats["reasoning_finished_count"] += 1
294+
if reasoning_info.get("reasoning_finished"):
295+
self._reasoning_stats["reasoning_finished_count"] += 1
296+
else:
297+
self._reasoning_stats["reasoning_unfinished_count"] += 1
290298

291299
# Update running averages
292300
for stat_key, value in [
@@ -340,6 +348,13 @@ def _update_reasoning_stats(self, reasoning_info: dict) -> None:
340348
updated_content_tokens
341349
)
342350

351+
# Update ratio
352+
if self._reasoning_stats["responses_with_reasoning"]:
353+
self._reasoning_stats["reasoning_finished_ratio"] = (
354+
self._reasoning_stats["reasoning_finished_count"]
355+
/ self._reasoning_stats["responses_with_reasoning"]
356+
)
357+
343358
# Log aggregated stats at specified interval
344359
if (
345360
self._reasoning_stats["total_responses"]

packages/nemo-evaluator/tests/unit_tests/adapters/interceptors/test_reasoning.py

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -449,6 +449,115 @@ def test_get_reasoning_info_explicit_content(
449449
assert reasoning_info["reasoning_started"]
450450

451451

452+
@pytest.mark.parametrize(
453+
"test_name,reasoning_started,reasoning_finished,expected_started_count,expected_finished_count,expected_unfinished_count",
454+
[
455+
(
456+
"reasoning_started_and_finished",
457+
True,
458+
True,
459+
1, # Started
460+
1, # Finished
461+
0, # Reasoning completed, not unfinished
462+
),
463+
(
464+
"reasoning_started_not_finished",
465+
True,
466+
False,
467+
1, # Started
468+
0, # Not finished
469+
1, # Reasoning started but truncated
470+
),
471+
(
472+
"reasoning_not_started",
473+
False,
474+
False,
475+
0, # Not started
476+
0, # Not finished
477+
0, # Reasoning never started
478+
),
479+
(
480+
"reasoning_not_started_but_finished_flag_true",
481+
# Edge case: reasoning_content is empty but content is non-empty
482+
# This can happen when reasoning_content="" and content="Final answer"
483+
# In this case, reasoning_finished=True but reasoning_started=False
484+
# We should NOT count this as finished since it never started
485+
False,
486+
True,
487+
0, # Not started
488+
0, # Should NOT be counted as finished since it never started
489+
0, # Not unfinished either since it never started
490+
),
491+
(
492+
"reasoning_started_unknown",
493+
# Edge case: start_reasoning_token is None and no end token found
494+
# In this case, reasoning_started="unknown" (truthy string)
495+
# We should NOT count this as started since we don't know
496+
"unknown",
497+
False,
498+
0, # Unknown should NOT be counted as started
499+
0, # Not finished
500+
0, # Not unfinished since we don't know if it started
501+
),
502+
],
503+
)
504+
def test_reasoning_unfinished_count(
505+
test_name,
506+
reasoning_started,
507+
reasoning_finished,
508+
expected_started_count,
509+
expected_finished_count,
510+
expected_unfinished_count,
511+
):
512+
"""Test that reasoning_unfinished_count is correctly tracked.
513+
514+
Maintains the mathematical invariant:
515+
unfinished_count = started_count - finished_count
516+
"""
517+
interceptor = ResponseReasoningInterceptor(
518+
params=ResponseReasoningInterceptor.Params(
519+
add_reasoning=True,
520+
enable_reasoning_tracking=True,
521+
enable_caching=False,
522+
)
523+
)
524+
525+
# Simulate reasoning info from _process_reasoning_message
526+
reasoning_info = {
527+
"reasoning_words": 10 if reasoning_started else 0,
528+
"original_content_words": 15 if reasoning_started else 5,
529+
"updated_content_words": 5,
530+
"reasoning_finished": reasoning_finished,
531+
"reasoning_started": reasoning_started,
532+
"reasoning_tokens": "unknown",
533+
"updated_content_tokens": "unknown",
534+
}
535+
536+
# Update stats with the reasoning info
537+
interceptor._update_reasoning_stats(reasoning_info)
538+
539+
# Verify the counts
540+
assert (
541+
interceptor._reasoning_stats["reasoning_started_count"]
542+
== expected_started_count
543+
)
544+
assert (
545+
interceptor._reasoning_stats["reasoning_finished_count"]
546+
== expected_finished_count
547+
)
548+
assert (
549+
interceptor._reasoning_stats["reasoning_unfinished_count"]
550+
== expected_unfinished_count
551+
)
552+
553+
# Verify the mathematical invariant: unfinished = started - finished
554+
assert (
555+
interceptor._reasoning_stats["reasoning_unfinished_count"]
556+
== interceptor._reasoning_stats["reasoning_started_count"]
557+
- interceptor._reasoning_stats["reasoning_finished_count"]
558+
)
559+
560+
452561
@pytest.mark.parametrize(
453562
"test_name,message_content,expected_reasoning_words,expected_original_content_words,expected_reasoning_finished",
454563
[
@@ -499,6 +608,54 @@ def test_get_reasoning_info_embedded_content(
499608
assert reasoning_info["reasoning_started"] == "unknown"
500609

501610

611+
def test_reasoning_ratio():
612+
"""Test _process_reasoning_message when reasoning content is embedded in the message content."""
613+
interceptor = ResponseReasoningInterceptor(
614+
params=ResponseReasoningInterceptor.Params(
615+
add_reasoning=True,
616+
enable_reasoning_tracking=True,
617+
end_reasoning_token="</think>",
618+
start_reasoning_token="<think>",
619+
enable_caching=False,
620+
)
621+
)
622+
# Create message with embedded reasoning content
623+
messages = []
624+
n_finished_reasoning = 7
625+
n_unfinished_reasoning = 3
626+
n_no_reasoning = 20
627+
628+
messages.extend(
629+
[
630+
{
631+
"role": "assistant",
632+
"content": "<think> thinking trace </think> rest of the message",
633+
}
634+
for _ in range(n_finished_reasoning)
635+
]
636+
)
637+
messages.extend(
638+
[
639+
{"role": "assistant", "content": "<think> thinking trace unfinished"}
640+
for _ in range(n_unfinished_reasoning)
641+
]
642+
)
643+
messages.extend(
644+
[
645+
{"role": "assistant", "content": "no thinking trace"}
646+
for _ in range(n_no_reasoning)
647+
]
648+
)
649+
650+
reasoning_info = None
651+
652+
# Test the _process_reasoning_message method directly
653+
for message in messages:
654+
_, reasoning_info = interceptor._process_reasoning_message(message)
655+
interceptor._update_reasoning_stats(reasoning_info)
656+
assert interceptor._reasoning_stats["reasoning_finished_ratio"] == 0.7
657+
658+
502659
@pytest.mark.parametrize(
503660
"test_name,include_if_not_finished,message_content,expected_content,expected_reasoning_words,expected_original_content_words",
504661
[

0 commit comments

Comments
 (0)