test: add property-based tests for sequence probability pipeline by chicham · Pull Request #224 · artefactory/artefactual

chicham · 2026-02-27T15:57:39Z

Summary

Property-based tests (Hypothesis) for the three components introduced in PR #223. These tests document the public API contracts and target long-term library correctness, not just the bugs found during review.

Also bumps the NumPy lower bound from >=1.24.0 to >=2.0.0 to match the version the project actually runs on.

New test files

tests/preprocessing/test_vllm_sampled_logprobs.py — 12 tests for vllm_sampled_tokens_logprobs
tests/preprocessing/test_openai_sampled_logprobs.py — 20 tests for sampled_tokens_logprobs_chat_completion_api and sampled_tokens_logprobs_responses_api, using realistic API response shapes that mirror the actual OpenAI wire format
tests/scoring/probability_methods/test_sentence_proba_properties.py — 14 tests for SentenceProbabilityScorer

Failing tests — why the current implementation does not fulfil its intended contract

test_variable_length_sequences_do_not_crash (vllm_parser.py:79)

vllm_sampled_tokens_logprobs is designed to extract the sampled-token logprob for each completion when the model is asked to produce multiple sequences (iterations > 1). In practice each completion independently reaches its stop token at a different position, so variable-length outputs are the normal case. The function collects per-sequence lists into sampled_logprobs and wraps the whole thing with np.array(sampled_logprobs), hoping to return a 2D array. NumPy 2 refuses to construct a 2D array from a ragged list and raises ValueError. The feature is entirely unusable whenever sequences differ in length.

test_chat_completion_ragged_batch_does_not_crash / test_responses_api_ragged_batch_does_not_crash (openai_parser.py:209, openai_parser.py:240)

Both OpenAI parsers document a return type of "A 2D Numpy array of shape (n_sequences, n_tokens)". When more than one completion is requested (OpenAI n > 1), each choice can have a different number of tokens. Both functions collect per-choice lists into all_sampled and pass the ragged structure to np.array(). Same NumPy 2 crash — the parsers fail exactly when multiple completions are requested with free generation length.

test_empty_sequence_does_not_silently_return_one / test_multiple_empty_sequences_raise (sentence_proba.py:23-24)

SentenceProbabilityScorer.compute is supposed to measure model confidence in a generated response as the joint probability of all tokens (exp(sum(logprobs))). When the logprob extraction pipeline returns no data for a sequence, inputs contains an empty array. np.sum([]) = 0.0, so np.exp(0.0) = 1.0: the scorer silently reports the maximum possible confidence for a sequence where it received no data at all. In an uncertainty library this is the worst possible silent failure — data absence is indistinguishable from absolute certainty.

test_very_long_sequence_underflow_is_surfaced (sentence_proba.py:23-24)

The scorer is supposed to rank sequences by confidence, giving lower scores to long or low-probability responses. For a sufficiently long sequence (e.g. 1500 tokens at −0.5 logprob each, sum = −750) np.exp(−750) underflows to 0.0 in float64. The score is mathematically wrong — the true probability is tiny but non-zero — and is indistinguishable from a sequence with −∞ total logprob. The scorer returns 0.0 silently, giving callers no signal that the score is invalid or that log-space arithmetic is needed.

Test plan

uv run --extra test python -m pytest \
  tests/preprocessing/test_vllm_sampled_logprobs.py \
  tests/preprocessing/test_openai_sampled_logprobs.py \
  tests/scoring/probability_methods/test_sentence_proba_properties.py \
  -v

37 tests pass (stable contracts confirmed). 6 tests fail (known defects documented above). All 6 should pass once PR #223 addresses them.

🤖 Generated with Claude Code

gemini-code-assist · 2026-02-27T15:57:58Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and long-term correctness of the log probability extraction and sentence scoring pipeline by introducing comprehensive property-based tests. These new tests validate the public API contracts of the vLLM and OpenAI logprob parsers, as well as the SentenceProbabilityScorer, ensuring their behavior aligns with expectations across a wide range of inputs. The changes also include an update to the NumPy dependency to version 2.0.0, reflecting the project's actual runtime environment.

Highlights

New Property-Based Tests: Introduced comprehensive property-based tests (using Hypothesis) for the vllm_sampled_tokens_logprobs, sampled_tokens_logprobs_chat_completion_api, sampled_tokens_logprobs_responses_api functions, and the SentenceProbabilityScorer class. These tests aim to verify public API contracts and ensure long-term library correctness.
Identified and Documented Existing Bugs: The new tests documented several existing defects and contract violations in the logprob extraction and scoring pipeline. These include issues with ragged arrays in NumPy 2, silent failures with empty sequences, and unhandled numerical underflow for very long or low-confidence sequences.
Dependency Update: Updated the minimum required NumPy version from >=1.24.0 to >=2.0.0 to align with the project's actual runtime environment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pyproject.toml
- Updated NumPy lower bound to >=2.0.0.
tests/preprocessing/test_openai_sampled_logprobs.py
- Added new property-based tests for sampled_tokens_logprobs_chat_completion_api and sampled_tokens_logprobs_responses_api verifying sampled logprob identity, value fidelity, value range, cardinality, length fidelity, variable-length batches, empty responses, and cross-format consistency.
tests/preprocessing/test_vllm_sampled_logprobs.py
- Added new property-based tests for vllm_sampled_tokens_logprobs covering sampled logprob identity, value range, cardinality, length fidelity, variable-length sequences, empty data handling, and determinism.
tests/scoring/probability_methods/test_sentence_proba_properties.py
- Added new property-based tests for SentenceProbabilityScorer covering value range, consistency between compute and compute_token_scores, cardinality, monotonicity, handling of list-of-arrays input, empty sequences, numerical underflow, and determinism.

Activity

Property-based tests were added for three key components of the sequence probability pipeline.
37 tests currently pass, confirming stable contracts.
6 tests currently fail, documenting known defects in the existing implementation that are expected to be addressed by PR Sequence prob computation #223.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of property-based tests for the sequence probability pipeline, using the Hypothesis library. The tests are well-designed to document the public API contracts and uncover several critical bugs in the current implementation related to handling ragged arrays, empty sequences, and numerical underflow. The addition of these failing tests is an excellent practice for test-driven development and ensuring long-term correctness. The bump of the NumPy version to >=2.0.0 is also justified. My main feedback is a suggestion to refactor the OpenAI parser tests to reduce code duplication, which would improve maintainability.

tests/preprocessing/test_openai_sampled_logprobs.py

chicham · 2026-02-27T15:59:04Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of property-based tests for the sequence probability pipeline using Hypothesis. The tests are well-structured, thoroughly document the public API contracts of the components, and effectively capture several critical bugs related to handling ragged arrays, empty sequences, and numerical underflow. Bumping the NumPy version to 2.0.0 is a necessary change to reveal these issues. The new test files are excellent additions that will significantly improve the long-term correctness and robustness of the library. I have a few minor suggestions to simplify some of the Hypothesis strategies for better readability and maintainability.

tests/preprocessing/test_vllm_sampled_logprobs.py

tests/scoring/probability_methods/test_sentence_proba_properties.py

chicham · 2026-02-27T16:51:20Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a robust suite of property-based tests using Hypothesis for the sequence probability pipeline components. This is an excellent approach to ensure the long-term correctness and numerical stability of the library, especially given the identified issues with ragged arrays and numerical underflow. The tests are well-structured, cover critical contract properties, and effectively target the known defects, which will be addressed in PR #223. The dependency bump for NumPy to 2.0.0 is also a necessary and appropriate change.

tests/preprocessing/test_openai_sampled_logprobs.py

Tests covering long-term correctness contracts for: - vllm_sampled_tokens_logprobs (test_vllm_sampled_logprobs.py) - OpenAI Chat Completion & Responses API parsers (test_openai_sampled_logprobs.py) - SentenceProbabilityScorer (test_sentence_proba_properties.py) Uses Hypothesis for property-based testing to verify value ranges, cardinality invariants, monotonicity, determinism, and robustness against edge cases (empty input, ragged sequences, underflow).

…egies

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

CharlesMoslonka · 2026-03-11T10:23:59Z

Note to self : raise Warning if the sequence proba is less than minimal float (to say that's 0.0 is expected here).

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

tests/preprocessing/test_openai_sampled_logprobs.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

tests/preprocessing/test_vllm_sampled_logprobs.py Show resolved Hide resolved

tests/preprocessing/test_vllm_sampled_logprobs.py Show resolved Hide resolved

tests/scoring/probability_methods/test_sentence_proba_properties.py Show resolved Hide resolved

chicham requested a review from CharlesMoslonka February 27, 2026 16:52

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

tests/preprocessing/test_openai_sampled_logprobs.py Outdated Show resolved Hide resolved

tests/preprocessing/test_openai_sampled_logprobs.py Outdated Show resolved Hide resolved

CharlesMoslonka marked this pull request as draft March 5, 2026 17:13

CharlesMoslonka closed this Mar 5, 2026

CharlesMoslonka reopened this Mar 5, 2026

This was referenced Mar 6, 2026

Add NumericalPrecisionWarning for underflow detection and mean token probability scorer #225

Open

Sequence prob computation #223

Merged

chicham marked this pull request as ready for review March 6, 2026 16:57

Base automatically changed from sequence_prob_computation to main March 9, 2026 09:59

CharlesMoslonka force-pushed the pr-223-property-tests branch from 3b519a4 to 4bc579d Compare March 9, 2026 10:58

chicham and others added 4 commits March 9, 2026 13:05

refactor: address review feedback — deduplicate tests, simplify strat…

660e515

…egies

Update tests/preprocessing/test_openai_sampled_logprobs.py

cb578cf

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update tests/preprocessing/test_openai_sampled_logprobs.py

4bc579d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

CharlesMoslonka marked this pull request as draft March 9, 2026 12:34

CharlesMoslonka marked this pull request as ready for review March 11, 2026 10:33

CharlesMoslonka marked this pull request as draft March 11, 2026 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add property-based tests for sequence probability pipeline#224

test: add property-based tests for sequence probability pipeline#224
chicham wants to merge 4 commits intomainfrom
pr-223-property-tests

chicham commented Feb 27, 2026

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chicham commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chicham commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

CharlesMoslonka commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chicham commented Feb 27, 2026

Summary

New test files

Failing tests — why the current implementation does not fulfil its intended contract

Test plan

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chicham commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chicham commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

CharlesMoslonka commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants