Skip to content

Add correctness verification to measure with optional scenarios/{benchmark}/verifier.uplc and introduce cape submission verify #19

@Unisay

Description

@Unisay

Summary

Ensure benchmark result correctness by adding a verification step to the measure tool, using an optional per-benchmark verifier program (test.uplc), and a cape test subcommand that runs negative fixture submissions stored outside the main submissions/ directory.

Motivation

  • Incorrect submissions can currently pass measurement for benchmarks like Fibonacci and Factorial.
  • We need a deterministic, reproducible correctness check that is separate from performance metrics.
  • Negative test submissions should not pollute normal submissions.

Functional Requirements

  • Submissions remain fully-applied, closed UPLC programs (no parameters).
  • Optional per-benchmark verifier program:
    • Path: scenarios/{benchmark}/test.uplc
    • Interface: UPLC function taking exactly one argument (the actual result from the submission program) and returning BuiltinUnit (unitval) on success; MUST error on mismatch.
    • Optionality rule:
      • If the submission evaluates to BuiltinUnit (unitval), no test.uplc is required and may be absent.
      • If the submission evaluates to a non-unit value, test.uplc is required and must validate the result.

Measure Tool Changes (measure/app/Main.hs)

  1. Load and evaluate the submission program with CEK under the same budget/settings used for measurement.
  2. If the result is unitval: mark correctness passed; do not require or load test.uplc.
  3. Otherwise: load scenarios/{benchmark}/test.uplc (required). If missing, fail with a specific error (see Errors).
  4. Apply (testProgram submissionResult) at UPLC level and evaluate with CEK.
    • Success: reduces to BuiltinUnit.
    • Failure: CEK error or non-unit → exit non-zero.
  5. Perform correctness verification before recording metrics; on failure, skip metrics and fail fast.
  6. BuiltinUnit detection should use the canonical unit value.

CLI / UX

  • Correctness verification enforced by default.
  • Optional hidden/debug flag to bypass verification locally (off by default).
  • Output should state whether the submission returned unitval directly or required test.uplc.
  • Suggested exit codes:
    • 0: success
    • 2: missing required test.uplc
    • 3: verification failed (non-unit or CEK error after applying verifier)
    • 4: malformed/invalid UPLC file
    • 5: internal tool error

Fixtures (Negative Tests)

  • Store negative fixture submissions outside normal submissions:
    • Root: test/fixtures/submissions/{benchmark}/{Compiler}_{version}_{contributor}/
  • Normal workflows (listing, measurement reports) ignore fixtures.
  • cape test enumerates fixtures only and runs verification via the measure tool.

cape test Subcommand

  • Implement at scripts/cape-subcommands/test.
  • Discover fixtures under test/fixtures/submissions/**.
  • Run the measure tool’s verification for each fixture; expect at least one negative fixture per benchmark to fail with exit code 3.
  • Never persist metrics or generate reports for fixtures.
  • Exit non-zero if any verification fails or a required test.uplc is missing.

File Discovery and Structure

  • Verifier program (optional):
    • scenarios/{benchmark}/test.uplc
  • Fixture submissions (isolated):
    • test/fixtures/submissions/{benchmark}/{Compiler}_{version}_{contributor}/

Example layout:

scenarios/
  fibonacci/
    test.uplc
  factorial/
    test.uplc
submissions/
  fibonacci/Plinth_1.52.0.0_Unisay/...
  factorial/Plinth_1.52.0.0_Unisay/...
test/fixtures/submissions/
  fibonacci/BadResult_1.0_Test/...
  factorial/BadResult_1.0_Test/...

Canonical Errors

  • Missing verifier (required):

    Missing test.uplc for benchmark '{benchmark}' at 'scenarios/{benchmark}/test.uplc' (required because the program does not return BuiltinUnit).

  • Verification failed:

    Verification failed for benchmark '{benchmark}' submission '{id}': (testProgram submissionResult) did not reduce to BuiltinUnit.

  • Program already unit:

    Submission result is BuiltinUnit; verifier not required.

Deliverables

  • measure/app/Main.hs: evaluation-first correctness check; conditional verifier application; standardized errors/exit codes.
  • scenarios/fibonacci/test.uplc and scenarios/factorial/test.uplc (verifiers for non-unit outputs).
  • Negative fixtures under test/fixtures/submissions/{benchmark}/... (at least one per benchmark).
  • scripts/cape-subcommands/test: fixture discovery and execution; honors exit codes; no metrics/reports.
  • Documentation updates (README.md or doc/usage.md): optionality rule & interface for test.uplc; fixture layout; cape test behavior and failure modes.

Acceptance Criteria

  • Unit-returning submissions pass without requiring test.uplc.
  • Correct Fibonacci/Factorial submissions pass with verifier.
  • Negative fixtures fail cape test with exit code 3 and clear errors; exit code 2 when a required verifier is missing.
  • Normal submission workflows ignore fixtures.
  • CI fails when verification fails or a required test.uplc is missing.

References

  • measure/app/Main.hs
  • scenarios/*/test.uplc
  • test/fixtures/submissions/
  • scripts/cape.sh, scripts/cape-subcommands/
  • .github/copilot-instructions.md

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions