Skip to content

Improve unit tests #690

@cdc-mitzimorris

Description

@cdc-mitzimorris

Given this instruction:

the existing unit tests for the code added in PRs 644 and 650 were developed to meet code coverage standards. they need to be reviewed thoroughly to see whether the unit tests actually test the API in a meaningful way - i.e., do they test for reasonable real-world use cases. the accompanying tutorials in directory docs/tutorials explain the intent of these classes. the directory datasets contains real-world data.

additional concerns are the amount of repeated, boilerplate code across unit tests. Ideally, the code has a set of test fixtures which demonstrate to maintainers how to insure that PyRenew components, give valid inputs, produce valid outputs, and that given invalid inputs, detect problems and return understandable errors. The tests need to be as maintainable as the code they are testing.

Claude produced the following report:

Test Suite Review: PRs 644 and 650

Executive Summary

The test suite was written to meet code coverage targets, not to verify that the API works correctly for real-world use cases. Evidence: extensive repr testing, no-op validate() testing, shape-only assertions, massive boilerplate, and zero integration with the real datasets that ship with the library. The tutorials demonstrate compelling workflows that the tests never exercise.

  1. Boilerplate and Maintainability Issues

Duplicated code across files:

  • create_mock_infections is defined identically in both conftest.py:367-402 and test_observation_counts.py:24-58
  • HierarchicalNormalNoise setup (VectorizedRV + DistributionalVariable for mode and sd) is repeated 8 times verbatim in test_observation_measurements.py
  • Counts(...) construction with DeterministicVariable/DeterministicPMF is repeated in nearly every test class in test_observation_counts.py, often not using the fixtures at all

conftest.py has too many nearly-identical PMF fixtures:

  • realistic_delay_pmf (10 elements) and long_delay_pmf (10 elements) are nearly identical
  • simple_delay_pmf and simple_shedding_pmf are both [1.0] — identical arrays with different names
  • 8 PMF fixtures total, but many tests ignore them and create their own inline

Missing fixtures for commonly-constructed objects:

  • No fixture for HierarchicalNormalNoise (the most repeated setup)
  • No fixture for ConcreteMeasurements (test implementation is 30+ lines)
  • No fixture for HierarchicalInfections shared across test files
  • No fixture for a configured PyrenewBuilder with multiple observations
  • No fixture that loads real data from pyrenew.datasets
  • CountsProcessFactory exists but there's no equivalent for Measurements, noise models, or the builder
  1. Coverage-Driven Tests with No Behavioral Value

These tests exist solely for coverage and should be replaced with meaningful behavioral tests:

repr tests (5+ tests across files):

  • TestNoiseRepr.test_poisson_noise_repr — tests string formatting
  • TestNoiseRepr.test_negative_binomial_noise_repr — tests string formatting
  • TestCountsRepr.test_counts_repr — tests string formatting
  • TestCountsRepr.test_counts_by_subpop_repr — tests string formatting
  • TestTemporalProcessRepr (3 tests) — tests string formatting
  • TestConcreteMeasurements.test_repr — tests string formatting
  • TestHierarchicalNormalNoise.test_repr — tests string formatting

No-op validate tests (3+ tests):

  • TestHierarchicalNormalPrior.test_validate_method — validate is a pass statement
  • TestGammaGroupSdPrior.test_validate_method — same
  • TestStudentTGroupModePrior.test_validate_method — same

Internal/private API tests:

  • TestCountBaseInternalMethods.test_count_base_infection_resolution_raises — tests that an abstract internal base raises NotImplementedError
  • TestBaseObservationProcessValidation.test_validate_pmf_* (3 tests) — tests private _validate_pmf method directly
  • TestMeasurementsBase.test_is_base_observation_process — tests class hierarchy with issubclass
  1. Weak Assertions (Shape-Only Tests)

Many tests construct realistic-looking scenarios but then only assert shapes:

  • TestCountsBasics.test_sample_returns_correct_shape — Checks shape and ndim but never verifies that predicted values approximate infections × ascertainment_rate convolved with the delay PMF. A test that returned random garbage of the right shape would pass.
  • TestCountsEdgeCases.test_zero_infections — Checks shape and >= 0 but never asserts that predicted values are all zero. Zero infections convolved with any delay at any ascertainment rate must produce zero predicted counts.
  • TestCountsEdgeCases.test_small_infections — Checks shape and >= 0 but doesn't verify that observed counts are small. With 10 infections/day and 1% ascertainment, expected counts are ~0.1/day — the test should verify counts are in a plausible range.
  • TestCountsWithPriors.test_with_stochastic_ascertainment — Checks shape and no-NaN but never verifies that the stochastic ascertainment rate actually produces variable results across seeds. The whole point of a stochastic parameter is that it introduces uncertainty.
  • TestConcreteMeasurements.test_sample_shape — Checks shape only. Doesn't verify predicted values are finite, on the expected log scale, or that they reflect the infection levels of the subpopulations being observed.
  • TestHierarchicalInfectionsSample.test_sample_returns_correct_shapes — Checks shape only. Doesn't verify that infections are positive or that aggregate equals the weighted sum of subpopulations. (The weighted-sum property is tested in a separate test, but positivity is never tested anywhere.)
  • TestTemporalProcesses (11 of 18 tests) — Assert only output shape (n_timepoints, n_processes). Never check that AR1 mean-reverts, that DifferencedAR1 allows persistent trends, or that RandomWalk variance grows over time. The 7 remaining tests that do check behavioral properties (innovation_sd comparisons) are the strongest tests in
    the file.
  1. Missing Tests: Real-World Use Cases from Tutorials

The tutorials demonstrate these workflows, but the tests never exercise them:

a) End-to-end multi-signal model (the central use case):
The building_multisignal_models.qmd tutorial shows building a model with BOTH hospital admissions (aggregate, Counts) AND wastewater (subpop, Measurements). No test ever builds a model with more than one observation process. test_pyrenew_builder.py only ever adds a single Counts observation.

b) Real data loading and model fitting:
The tutorials load CA hospital admissions and wastewater data, prepare it with pad_observations/shift_times, and run inference. No test exercises this data-to-inference pipeline. test_datasets_nwss_hosp.py only checks that loading works — it never feeds data into a model.

c) Prior predictive checking:
The latent_hierarchical_infections.qmd tutorial uses numpyro.infer.Predictive for prior predictive checks. No test verifies this workflow.

d) pad_observations and shift_times correctness:
These helper methods are tested only indirectly (through test_run_with_population_structure). No test verifies that pad_observations prepends the correct number of NaN values, or that shift_times adds the correct offset.

e) Sparse observation patterns:
The tutorials show wastewater observations with irregular time sampling, multiple sensors per subpopulation, and sparse subpopulation coverage. No test exercises irregular timing or missing subpopulations.

  1. Missing Tests: Mathematical/Behavioral Properties

Convolution correctness:
No test verifies that the delay convolution produces hand-computable results. For a spike input of 1000 infections on day 10 and a delay PMF of [0.5, 0.5], the predicted counts on day 10 should be 500 and on day 11 should be 500. test_delay_convolution checks only that NaN appears in the first entries.

Observation conditioning (obs passthrough):
test_sample_with_observations in measurements tests this, but test_observation_counts.py never verifies that providing obs=actual_data returns those exact values.

NaN masking effectiveness:
test_dense_observations_with_nan_padding checks shapes but never verifies that NaN-padded initialization days are actually masked from the likelihood. This is a critical correctness property.

Temporal process dynamics:

  • No test that AR1 with |autoreg| < 1 produces stationary trajectories (variance stabilizes)
  • No test that DifferencedAR1 allows persistent trends while stabilizing growth rate
  • No test comparing RandomWalk variance growth (unbounded) vs AR1 variance (bounded)

Infections positivity and epidemiological plausibility:
No test verifies that HierarchicalInfections produces positive infections, or that Rt > 1 leads to growing infections and Rt < 1 leads to declining infections.

  1. Missing Tests: API Boundary Conditions
  • Single subpopulation (K=1) — does the model degenerate correctly?
  • Many subpopulations (K=10+) — does vectorization work?
  • Different temporal processes in the same HierarchicalInfections (AR1, DifferencedAR1 — currently only RandomWalk is tested)
  • Longer simulations (90+ days, matching tutorial lengths)
  • CountsBySubpop with real-world-like observation patterns (weekly sampling, missing sites)
  1. Specific Recommendations

A. Consolidate fixtures in conftest.py:

Add these missing fixtures:

  • hierarchical_normal_noise — eliminates 8x boilerplate
  • concrete_measurements — eliminates repeated ConcreteMeasurements setup
  • gen_int_rv — shared across latent and builder tests
  • hierarchical_infections — configured instance
  • multi_signal_builder — builder with both Counts and Measurements
  • ca_hospital_data / ca_wastewater_data — real data loading

Remove: duplicate PMF fixtures (realistic_delay_pmf vs long_delay_pmf, simple_delay_pmf vs simple_shedding_pmf). Remove the duplicate create_mock_infections.

B. Replace coverage tests with behavioral tests:

Delete all repr, no-op validate(), and internal-method tests. Replace with:

  • Convolution correctness (hand-computable spike + PMF → expected output)
  • Obs conditioning passthrough (provide obs, get same obs back)
  • NaN masking effectiveness (NaN entries don't affect likelihood)
  • Temporal process behavioral properties (mean reversion, trend persistence, drift)
  • Infections positivity and weighted-sum property

C. Add integration tests with real data:

At minimum, one test that:

  1. Loads CA hospital data via load_hospital_data_for_state()
  2. Loads CA wastewater data via load_wastewater_data_for_state()
  3. Builds a multi-signal model with both observations
  4. Runs a short inference (5 warmup, 5 samples)
  5. Verifies posterior samples have expected shapes and reasonable values

D. Use @pytest.mark.parametrize:

Replace repeated test patterns with parametrized versions:

  • Temporal processes: parametrize over (AR1, DifferencedAR1, RandomWalk)
  • PMF lengths: parametrize over short/medium/long
  • Subpopulation counts: parametrize over K=1, 3, 5, 10

E. Test fixture hierarchy:

Organize fixtures into levels:

  • Level 0: Primitives — PMFs, infection arrays, RandomVariables
  • Level 1: Components — configured Counts, Measurements, HierarchicalInfections
  • Level 2: Composed models — builder + latent + observations
  • Level 3: Models with real data — ready for inference

This gives maintainers clear examples of how to construct each layer of the API.

Summary

Of the 113 total tests, approximately 40 are either pure coverage filler or check only shapes. The remaining tests check meaningful properties but miss the high-value real-world workflows the tutorials demonstrate.

  • test_observation_counts.py (25 tests, 12 weak) — Missing: hand-computable convolution correctness (spike + PMF = expected output), verification that providing obs returns those exact values, verification that NaN-masked initialization days don't affect the likelihood.
  • test_observation_measurements.py (10 tests, 5 weak) — Missing: tests that different sensors produce different biases, irregular/sparse observation timing patterns, any use of real wastewater data from pyrenew.datasets.
  • test_base_latent_process.py (11 tests, all meaningful) — Missing: single-subpopulation (K=1) degenerate case, fractions summing to 1.0 at the edge of tolerance (e.g., 1.0 ± 1e-7).
  • test_hierarchical_infections.py (11 tests, 2 weak) — Missing: assertion that infections are always positive, tests using AR1 or DifferencedAR1 (only RandomWalk is ever used), tests with the real generation interval PMF from pyrenew.datasets.
  • test_hierarchical_priors.py (11 tests, 5 weak) — Missing: assertion that GammaGroupSdPrior always produces positive values (it's a SD prior), test of the sd_min=0 boundary case.
  • test_temporal_processes.py (18 tests, 14 weak) — Missing: AR1 stationarity (variance stabilizes over time when |autoreg| < 1), AR1 mean reversion (trajectories return toward 0), DifferencedAR1 trend persistence (level can drift while growth rate stabilizes), RandomWalk unbounded variance growth vs AR1 bounded variance.
  • test_pyrenew_builder.py (15 tests, 2 weak) — Missing: building a model with both aggregate and subpop observations simultaneously (the entire point of "multi-signal"), direct tests of pad_observations and shift_times correctness, any use of real data from pyrenew.datasets.
  • test_datasets_nwss_hosp.py (12 tests, all meaningful for data loading) — Missing: any test that feeds the loaded data into model components. The data loaders are tested in isolation but never connected to the API they exist to serve.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions