test: add property-based tests for sequence probability pipeline#224
test: add property-based tests for sequence probability pipeline#224
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and long-term correctness of the log probability extraction and sentence scoring pipeline by introducing comprehensive property-based tests. These new tests validate the public API contracts of the vLLM and OpenAI logprob parsers, as well as the SentenceProbabilityScorer, ensuring their behavior aligns with expectations across a wide range of inputs. The changes also include an update to the NumPy dependency to version 2.0.0, reflecting the project's actual runtime environment. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of property-based tests for the sequence probability pipeline, using the Hypothesis library. The tests are well-designed to document the public API contracts and uncover several critical bugs in the current implementation related to handling ragged arrays, empty sequences, and numerical underflow. The addition of these failing tests is an excellent practice for test-driven development and ensuring long-term correctness. The bump of the NumPy version to >=2.0.0 is also justified. My main feedback is a suggestion to refactor the OpenAI parser tests to reduce code duplication, which would improve maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of property-based tests for the sequence probability pipeline using Hypothesis. The tests are well-structured, thoroughly document the public API contracts of the components, and effectively capture several critical bugs related to handling ragged arrays, empty sequences, and numerical underflow. Bumping the NumPy version to 2.0.0 is a necessary change to reveal these issues. The new test files are excellent additions that will significantly improve the long-term correctness and robustness of the library. I have a few minor suggestions to simplify some of the Hypothesis strategies for better readability and maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust suite of property-based tests using Hypothesis for the sequence probability pipeline components. This is an excellent approach to ensure the long-term correctness and numerical stability of the library, especially given the identified issues with ragged arrays and numerical underflow. The tests are well-structured, cover critical contract properties, and effectively target the known defects, which will be addressed in PR #223. The dependency bump for NumPy to 2.0.0 is also a necessary and appropriate change.
3b519a4 to
4bc579d
Compare
Tests covering long-term correctness contracts for: - vllm_sampled_tokens_logprobs (test_vllm_sampled_logprobs.py) - OpenAI Chat Completion & Responses API parsers (test_openai_sampled_logprobs.py) - SentenceProbabilityScorer (test_sentence_proba_properties.py) Uses Hypothesis for property-based testing to verify value ranges, cardinality invariants, monotonicity, determinism, and robustness against edge cases (empty input, ragged sequences, underflow).
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Note to self : raise Warning if the sequence proba is less than minimal float (to say that's 0.0 is expected here). |
Summary
Property-based tests (Hypothesis) for the three components introduced in PR #223. These tests document the public API contracts and target long-term library correctness, not just the bugs found during review.
Also bumps the NumPy lower bound from
>=1.24.0to>=2.0.0to match the version the project actually runs on.New test files
tests/preprocessing/test_vllm_sampled_logprobs.py— 12 tests forvllm_sampled_tokens_logprobstests/preprocessing/test_openai_sampled_logprobs.py— 20 tests forsampled_tokens_logprobs_chat_completion_apiandsampled_tokens_logprobs_responses_api, using realistic API response shapes that mirror the actual OpenAI wire formattests/scoring/probability_methods/test_sentence_proba_properties.py— 14 tests forSentenceProbabilityScorerFailing tests — why the current implementation does not fulfil its intended contract
test_variable_length_sequences_do_not_crash(vllm_parser.py:79)vllm_sampled_tokens_logprobsis designed to extract the sampled-token logprob for each completion when the model is asked to produce multiple sequences (iterations > 1). In practice each completion independently reaches its stop token at a different position, so variable-length outputs are the normal case. The function collects per-sequence lists intosampled_logprobsand wraps the whole thing withnp.array(sampled_logprobs), hoping to return a 2D array. NumPy 2 refuses to construct a 2D array from a ragged list and raisesValueError. The feature is entirely unusable whenever sequences differ in length.test_chat_completion_ragged_batch_does_not_crash/test_responses_api_ragged_batch_does_not_crash(openai_parser.py:209,openai_parser.py:240)Both OpenAI parsers document a return type of "A 2D Numpy array of shape (n_sequences, n_tokens)". When more than one completion is requested (OpenAI
n > 1), each choice can have a different number of tokens. Both functions collect per-choice lists intoall_sampledand pass the ragged structure tonp.array(). Same NumPy 2 crash — the parsers fail exactly when multiple completions are requested with free generation length.test_empty_sequence_does_not_silently_return_one/test_multiple_empty_sequences_raise(sentence_proba.py:23-24)SentenceProbabilityScorer.computeis supposed to measure model confidence in a generated response as the joint probability of all tokens (exp(sum(logprobs))). When the logprob extraction pipeline returns no data for a sequence,inputscontains an empty array.np.sum([]) = 0.0, sonp.exp(0.0) = 1.0: the scorer silently reports the maximum possible confidence for a sequence where it received no data at all. In an uncertainty library this is the worst possible silent failure — data absence is indistinguishable from absolute certainty.test_very_long_sequence_underflow_is_surfaced(sentence_proba.py:23-24)The scorer is supposed to rank sequences by confidence, giving lower scores to long or low-probability responses. For a sufficiently long sequence (e.g. 1500 tokens at −0.5 logprob each, sum = −750)
np.exp(−750)underflows to0.0in float64. The score is mathematically wrong — the true probability is tiny but non-zero — and is indistinguishable from a sequence with −∞ total logprob. The scorer returns0.0silently, giving callers no signal that the score is invalid or that log-space arithmetic is needed.Test plan
uv run --extra test python -m pytest \ tests/preprocessing/test_vllm_sampled_logprobs.py \ tests/preprocessing/test_openai_sampled_logprobs.py \ tests/scoring/probability_methods/test_sentence_proba_properties.py \ -v37 tests pass (stable contracts confirmed). 6 tests fail (known defects documented above). All 6 should pass once PR #223 addresses them.
🤖 Generated with Claude Code