Skip to content

test: add property-based tests for sequence probability pipeline#224

Draft
chicham wants to merge 4 commits intomainfrom
pr-223-property-tests
Draft

test: add property-based tests for sequence probability pipeline#224
chicham wants to merge 4 commits intomainfrom
pr-223-property-tests

Conversation

@chicham
Copy link
Collaborator

@chicham chicham commented Feb 27, 2026

Summary

Property-based tests (Hypothesis) for the three components introduced in PR #223. These tests document the public API contracts and target long-term library correctness, not just the bugs found during review.

Also bumps the NumPy lower bound from >=1.24.0 to >=2.0.0 to match the version the project actually runs on.

New test files

  • tests/preprocessing/test_vllm_sampled_logprobs.py — 12 tests for vllm_sampled_tokens_logprobs
  • tests/preprocessing/test_openai_sampled_logprobs.py — 20 tests for sampled_tokens_logprobs_chat_completion_api and sampled_tokens_logprobs_responses_api, using realistic API response shapes that mirror the actual OpenAI wire format
  • tests/scoring/probability_methods/test_sentence_proba_properties.py — 14 tests for SentenceProbabilityScorer

Failing tests — why the current implementation does not fulfil its intended contract

test_variable_length_sequences_do_not_crash (vllm_parser.py:79)

vllm_sampled_tokens_logprobs is designed to extract the sampled-token logprob for each completion when the model is asked to produce multiple sequences (iterations > 1). In practice each completion independently reaches its stop token at a different position, so variable-length outputs are the normal case. The function collects per-sequence lists into sampled_logprobs and wraps the whole thing with np.array(sampled_logprobs), hoping to return a 2D array. NumPy 2 refuses to construct a 2D array from a ragged list and raises ValueError. The feature is entirely unusable whenever sequences differ in length.

test_chat_completion_ragged_batch_does_not_crash / test_responses_api_ragged_batch_does_not_crash (openai_parser.py:209, openai_parser.py:240)

Both OpenAI parsers document a return type of "A 2D Numpy array of shape (n_sequences, n_tokens)". When more than one completion is requested (OpenAI n > 1), each choice can have a different number of tokens. Both functions collect per-choice lists into all_sampled and pass the ragged structure to np.array(). Same NumPy 2 crash — the parsers fail exactly when multiple completions are requested with free generation length.

test_empty_sequence_does_not_silently_return_one / test_multiple_empty_sequences_raise (sentence_proba.py:23-24)

SentenceProbabilityScorer.compute is supposed to measure model confidence in a generated response as the joint probability of all tokens (exp(sum(logprobs))). When the logprob extraction pipeline returns no data for a sequence, inputs contains an empty array. np.sum([]) = 0.0, so np.exp(0.0) = 1.0: the scorer silently reports the maximum possible confidence for a sequence where it received no data at all. In an uncertainty library this is the worst possible silent failure — data absence is indistinguishable from absolute certainty.

test_very_long_sequence_underflow_is_surfaced (sentence_proba.py:23-24)

The scorer is supposed to rank sequences by confidence, giving lower scores to long or low-probability responses. For a sufficiently long sequence (e.g. 1500 tokens at −0.5 logprob each, sum = −750) np.exp(−750) underflows to 0.0 in float64. The score is mathematically wrong — the true probability is tiny but non-zero — and is indistinguishable from a sequence with −∞ total logprob. The scorer returns 0.0 silently, giving callers no signal that the score is invalid or that log-space arithmetic is needed.

Test plan

uv run --extra test python -m pytest \
  tests/preprocessing/test_vllm_sampled_logprobs.py \
  tests/preprocessing/test_openai_sampled_logprobs.py \
  tests/scoring/probability_methods/test_sentence_proba_properties.py \
  -v

37 tests pass (stable contracts confirmed). 6 tests fail (known defects documented above). All 6 should pass once PR #223 addresses them.

🤖 Generated with Claude Code

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and long-term correctness of the log probability extraction and sentence scoring pipeline by introducing comprehensive property-based tests. These new tests validate the public API contracts of the vLLM and OpenAI logprob parsers, as well as the SentenceProbabilityScorer, ensuring their behavior aligns with expectations across a wide range of inputs. The changes also include an update to the NumPy dependency to version 2.0.0, reflecting the project's actual runtime environment.

Highlights

  • New Property-Based Tests: Introduced comprehensive property-based tests (using Hypothesis) for the vllm_sampled_tokens_logprobs, sampled_tokens_logprobs_chat_completion_api, sampled_tokens_logprobs_responses_api functions, and the SentenceProbabilityScorer class. These tests aim to verify public API contracts and ensure long-term library correctness.
  • Identified and Documented Existing Bugs: The new tests documented several existing defects and contract violations in the logprob extraction and scoring pipeline. These include issues with ragged arrays in NumPy 2, silent failures with empty sequences, and unhandled numerical underflow for very long or low-confidence sequences.
  • Dependency Update: Updated the minimum required NumPy version from >=1.24.0 to >=2.0.0 to align with the project's actual runtime environment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pyproject.toml
    • Updated NumPy lower bound to >=2.0.0.
  • tests/preprocessing/test_openai_sampled_logprobs.py
    • Added new property-based tests for sampled_tokens_logprobs_chat_completion_api and sampled_tokens_logprobs_responses_api verifying sampled logprob identity, value fidelity, value range, cardinality, length fidelity, variable-length batches, empty responses, and cross-format consistency.
  • tests/preprocessing/test_vllm_sampled_logprobs.py
    • Added new property-based tests for vllm_sampled_tokens_logprobs covering sampled logprob identity, value range, cardinality, length fidelity, variable-length sequences, empty data handling, and determinism.
  • tests/scoring/probability_methods/test_sentence_proba_properties.py
    • Added new property-based tests for SentenceProbabilityScorer covering value range, consistency between compute and compute_token_scores, cardinality, monotonicity, handling of list-of-arrays input, empty sequences, numerical underflow, and determinism.
Activity
  • Property-based tests were added for three key components of the sequence probability pipeline.
  • 37 tests currently pass, confirming stable contracts.
  • 6 tests currently fail, documenting known defects in the existing implementation that are expected to be addressed by PR Sequence prob computation #223.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of property-based tests for the sequence probability pipeline, using the Hypothesis library. The tests are well-designed to document the public API contracts and uncover several critical bugs in the current implementation related to handling ragged arrays, empty sequences, and numerical underflow. The addition of these failing tests is an excellent practice for test-driven development and ensuring long-term correctness. The bump of the NumPy version to >=2.0.0 is also justified. My main feedback is a suggestion to refactor the OpenAI parser tests to reduce code duplication, which would improve maintainability.

@chicham
Copy link
Collaborator Author

chicham commented Feb 27, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of property-based tests for the sequence probability pipeline using Hypothesis. The tests are well-structured, thoroughly document the public API contracts of the components, and effectively capture several critical bugs related to handling ragged arrays, empty sequences, and numerical underflow. Bumping the NumPy version to 2.0.0 is a necessary change to reveal these issues. The new test files are excellent additions that will significantly improve the long-term correctness and robustness of the library. I have a few minor suggestions to simplify some of the Hypothesis strategies for better readability and maintainability.

@chicham
Copy link
Collaborator Author

chicham commented Feb 27, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust suite of property-based tests using Hypothesis for the sequence probability pipeline components. This is an excellent approach to ensure the long-term correctness and numerical stability of the library, especially given the identified issues with ragged arrays and numerical underflow. The tests are well-structured, cover critical contract properties, and effectively target the known defects, which will be addressed in PR #223. The dependency bump for NumPy to 2.0.0 is also a necessary and appropriate change.

@CharlesMoslonka CharlesMoslonka marked this pull request as draft March 5, 2026 17:13
@chicham chicham marked this pull request as ready for review March 6, 2026 16:57
Base automatically changed from sequence_prob_computation to main March 9, 2026 09:59
@CharlesMoslonka CharlesMoslonka force-pushed the pr-223-property-tests branch from 3b519a4 to 4bc579d Compare March 9, 2026 10:58
chicham and others added 4 commits March 9, 2026 13:05
Tests covering long-term correctness contracts for:
- vllm_sampled_tokens_logprobs (test_vllm_sampled_logprobs.py)
- OpenAI Chat Completion & Responses API parsers (test_openai_sampled_logprobs.py)
- SentenceProbabilityScorer (test_sentence_proba_properties.py)

Uses Hypothesis for property-based testing to verify value ranges,
cardinality invariants, monotonicity, determinism, and robustness
against edge cases (empty input, ragged sequences, underflow).
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@CharlesMoslonka CharlesMoslonka marked this pull request as draft March 9, 2026 12:34
@CharlesMoslonka
Copy link
Collaborator

Note to self : raise Warning if the sequence proba is less than minimal float (to say that's 0.0 is expected here).

@CharlesMoslonka CharlesMoslonka marked this pull request as ready for review March 11, 2026 10:33
@CharlesMoslonka CharlesMoslonka marked this pull request as draft March 11, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants