Skip to content

Comments

Add comprehensive benchmark tests with quality validation#33

Merged
Anselmoo merged 2 commits intomainfrom
feature/comprehensive-benchmark-tests
Dec 21, 2025
Merged

Add comprehensive benchmark tests with quality validation#33
Anselmoo merged 2 commits intomainfrom
feature/comprehensive-benchmark-tests

Conversation

@Anselmoo
Copy link
Owner

Summary

This PR adds comprehensive benchmark tests that validate optimizer solutions against known optimal points, ensuring optimizers actually find correct solutions rather than just running without errors.

Changes

New Test Files

  • conftest.py: Pytest fixtures with 17 benchmark functions and their known optima
  • test_benchmarks.py: Quality tolerance tests for all optimizer categories
  • test_performance.py: Performance regression baselines and critical path tests

Key Features

Solution Quality Validation

  • Tests validate solutions against known optimal points (e.g., shifted_ackley optimal at [1.0, 0.5])
  • Critical failures flagged when solution distance exceeds tolerance (e.g., (1.2, 0.7) is flagged as failure for shifted_ackley)
  • Different tolerance levels for easy/medium/hard benchmark functions

xfail Markers for Known Issues
The following optimizers are marked with @pytest.mark.xfail as they converge to local minima on multimodal functions like shifted_ackley:

  • BFGS - converges to local minimum
  • LBFGS - converges to local minimum
  • NelderMead - converges to local minimum
  • GreyWolfOptimizer - convergence issues

These tests still run but don't fail the suite, allowing tracking of potential improvements.

Test Coverage

  • 356 test cases total
  • Critical shifted_ackley benchmark tests
  • Sphere, rosenbrock, and other benchmark functions
  • Performance regression baselines
  • Statistical consistency tests
  • Reproducibility and bounds checking

Test Results

15 passed, 4 xfailed in 10.25s

The 4 xfailed tests are the known local-minima-prone optimizers which are documented above.

Benchmark Functions with Known Optima

Function Optimal Point Tolerance Difficulty
sphere [0, 0] 0.1 easy
shifted_ackley [1.0, 0.5] 0.2 medium
rosenbrock [1, 1] 0.5 hard
himmelblau 4 optima 0.5 medium
... ... ... ...

Usage

# Run all benchmark tests
uv run pytest opt/test/test_benchmarks.py -v

# Run only critical shifted_ackley tests
uv run pytest opt/test/test_benchmarks.py::TestShiftedAckleyBenchmark -v

# Skip xfail tests
uv run pytest opt/test/test_benchmarks.py --ignore-glob="*xfail*"

- Add conftest.py with 17 benchmark functions and known optima
- Add test_benchmarks.py with quality tolerance tests for all optimizer categories
- Add test_performance.py with regression baselines and critical path tests

Test improvements:
- Validate solutions against known optimal points (shifted_ackley at [1.0, 0.5])
- Flag solutions deviating from optimum as critical failures (distance > 0.2)
- Add xfail markers for optimizers prone to local minima on multimodal functions
- Include Himmelblau multi-optima handling
- Add reproducibility and bounds checking tests

356 test cases covering optimizer quality validation"
Copilot AI review requested due to automatic review settings December 21, 2025 08:02
@Anselmoo Anselmoo merged commit ae3aa00 into main Dec 21, 2025
2 of 5 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive benchmark tests that validate optimizer solutions against known optimal points, ensuring optimizers find correct solutions rather than just executing without errors. The tests include quality validation, performance regression detection, and critical path testing for the optimization library.

Key Changes

  • Introduces quality tolerance tests validating solutions against known optima (e.g., shifted_ackley optimal at [1.0, 0.5])
  • Adds performance regression baselines for tracking optimizer behavior changes
  • Implements xfail markers for known optimizer limitations (BFGS, LBFGS, NelderMead, GreyWolfOptimizer on multimodal functions)

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

File Description
opt/test/conftest.py Adds pytest fixtures for 17 benchmark functions with known optima and helper utilities for quality assessment
opt/test/test_benchmarks.py Implements quality tolerance tests across optimizer categories with varying difficulty levels
opt/test/test_performance.py Adds performance regression baselines and critical path tests for shifted_ackley, sphere, and rosenbrock functions
.github/workflows/python-publish.yaml Updates sigstore action from v3.1.0 to v3.2.0 for package signing

Attributes:
optimizer_class: The optimizer class.
function_name: Name of the benchmark function.
expected_fitness_upper: Upper bound on expected fitness (worse case).
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the docstring: "worse case" should be "worst case".

Suggested change
expected_fitness_upper: Upper bound on expected fitness (worse case).
expected_fitness_upper: Upper bound on expected fitness (worst case).

Copilot uses AI. Check for mistakes.
mean_fitness = np.mean(results)
std_fitness = np.std(results)
best_fitness = min(results)
max(results)
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable worst_fitness is assigned but never used. The result of max(results) is computed but not assigned to anything. Consider removing this line or assigning it to worst_fitness if you intend to use it for future assertions.

Suggested change
max(results)

Copilot uses AI. Check for mistakes.
dim=2,
max_iter=200,
)
solution, _fitness = optimizer.search()
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fitness variable is assigned but never used in these critical tests. Consider adding an assertion to check the fitness value is within acceptable bounds, similar to the pattern used in test_particle_swarm_critical (line 309), or use an underscore prefix to indicate it's intentionally unused.

Copilot uses AI. Check for mistakes.
dim=2,
max_iter=300,
)
solution, _fitness = optimizer.search()
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fitness variable is assigned but never used in this critical test. Consider adding an assertion to check the fitness value is within acceptable bounds, or use an underscore prefix to indicate it's intentionally unused.

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +253
solution, _fitness = optimizer.search()

distance = np.linalg.norm(solution - self.OPTIMAL_POINT)
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fitness variable is assigned but never used in this test. Consider adding an assertion to check the fitness value or use an underscore prefix to indicate it's intentionally unused.

Suggested change
solution, _fitness = optimizer.search()
distance = np.linalg.norm(solution - self.OPTIMAL_POINT)
solution, fitness = optimizer.search()
distance = np.linalg.norm(solution - self.OPTIMAL_POINT)
assert np.isfinite(fitness), (
f"{optimizer_class.__name__} returned non-finite fitness {fitness} "
"on shifted_ackley for medium-performance benchmark."
)

Copilot uses AI. Check for mistakes.

OPTIMAL_POINT = np.array([1.0, 0.5])
CRITICAL_TOLERANCE = 0.2 # Distance > 0.2 is a critical failure
WARNING_TOLERANCE = 0.1 # Distance > 0.1 but <= 0.2 is a warning
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class attribute WARNING_TOLERANCE is defined but never used in any of the test methods. Consider removing this constant or implementing warning-level assertions if it was intended to provide additional validation between the critical tolerance and warning tolerance thresholds.

Suggested change
WARNING_TOLERANCE = 0.1 # Distance > 0.1 but <= 0.2 is a warning

Copilot uses AI. Check for mistakes.

OPTIMAL_POINT = np.array([0.0, 0.0])
TIGHT_TOLERANCE = 0.1
RELAXED_TOLERANCE = 0.5
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class attribute RELAXED_TOLERANCE is defined but never used in the test method. Consider removing this constant or adding a test method for medium/lower performance optimizers that uses this relaxed tolerance threshold.

Suggested change
RELAXED_TOLERANCE = 0.5

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +189
VARIABLE_PERFORMANCE_OPTIMIZERS = [
ArtificialFishSwarm,
CatSwarmOptimization,
GlowwormSwarmOptimization,
SquirrelSearchAlgorithm,
CollidingBodiesOptimization,
EagleStrategy,
CulturalAlgorithm,
EstimationOfDistributionAlgorithm,
ImperialistCompetitiveAlgorithm,
ParticleFilter,
ShuffledFrogLeapingAlgorithm,
StochasticDiffusionSearch,
StochasticFractalSearch,
VariableDepthSearch,
VariableNeighborhoodSearch,
VeryLargeScaleNeighborhood,
]

# Gradient-based optimizers (may converge to local optima)
GRADIENT_OPTIMIZERS = [
AdaDelta,
ADAGrad,
AdaMax,
AdamW,
ADAMOptimization,
AMSGrad,
Nadam,
NesterovAcceleratedGradient,
RMSprop,
SGD,
SGDMomentum,
ConjugateGradient,
TrustRegion,
HillClimbing,
TabuSearch,
]

# Constrained/Probabilistic optimizers
SPECIALIZED_OPTIMIZERS = [
AugmentedLagrangian,
SuccessiveLinearProgramming,
LDAnalysis,
ParzenTreeEstimator,
]
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizer category lists VARIABLE_PERFORMANCE_OPTIMIZERS, GRADIENT_OPTIMIZERS, and SPECIALIZED_OPTIMIZERS are defined but not used in any test methods. Consider either removing these unused constants or adding test methods that utilize them. If they are intended for future use, add a comment indicating this.

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +195
_solution, fitness = optimizer.search()

assert fitness <= baseline.expected_fitness_upper, (
f"REGRESSION: {baseline.optimizer_class.__name__} on rosenbrock: "
f"fitness {fitness:.4f} > {baseline.expected_fitness_upper:.4f}"
)


Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution variable is assigned but never used in this test. Either remove the underscore prefix from _solution and add an assertion about the solution's distance from the optimum (similar to test_sphere_regression), or remove the variable assignment entirely if only fitness checking is needed.

Suggested change
_solution, fitness = optimizer.search()
assert fitness <= baseline.expected_fitness_upper, (
f"REGRESSION: {baseline.optimizer_class.__name__} on rosenbrock: "
f"fitness {fitness:.4f} > {baseline.expected_fitness_upper:.4f}"
)
solution, fitness = optimizer.search()
assert fitness <= baseline.expected_fitness_upper, (
f"REGRESSION: {baseline.optimizer_class.__name__} on rosenbrock: "
f"fitness {fitness:.4f} > {baseline.expected_fitness_upper:.4f}"
)
distance = np.linalg.norm(solution - OPTIMAL_POINTS["rosenbrock"])
assert distance <= baseline.max_distance_from_optimum, (
f"REGRESSION: {baseline.optimizer_class.__name__} distance {distance:.4f} "
f"exceeds {baseline.max_distance_from_optimum:.4f}"
)

Copilot uses AI. Check for mistakes.
n_bats=30,
max_iter=200,
)
solution, _fitness = optimizer.search()
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fitness variable is assigned but never used in this test. Consider adding an assertion to check the fitness value or use an underscore prefix to indicate it's intentionally unused.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant