Summary
Ensure benchmark result correctness by adding a verification step to the measure tool, using an optional per-benchmark verifier program (test.uplc), and a cape test subcommand that runs negative fixture submissions stored outside the main submissions/ directory.
Motivation
- Incorrect submissions can currently pass measurement for benchmarks like Fibonacci and Factorial.
- We need a deterministic, reproducible correctness check that is separate from performance metrics.
- Negative test submissions should not pollute normal submissions.
Functional Requirements
- Submissions remain fully-applied, closed UPLC programs (no parameters).
- Optional per-benchmark verifier program:
- Path:
scenarios/{benchmark}/test.uplc
- Interface: UPLC function taking exactly one argument (the actual result from the submission program) and returning
BuiltinUnit (unitval) on success; MUST error on mismatch.
- Optionality rule:
- If the submission evaluates to
BuiltinUnit (unitval), no test.uplc is required and may be absent.
- If the submission evaluates to a non-unit value,
test.uplc is required and must validate the result.
Measure Tool Changes (measure/app/Main.hs)
- Load and evaluate the submission program with CEK under the same budget/settings used for measurement.
- If the result is unitval: mark correctness passed; do not require or load
test.uplc.
- Otherwise: load
scenarios/{benchmark}/test.uplc (required). If missing, fail with a specific error (see Errors).
- Apply
(testProgram submissionResult) at UPLC level and evaluate with CEK.
- Success: reduces to
BuiltinUnit.
- Failure: CEK error or non-unit → exit non-zero.
- Perform correctness verification before recording metrics; on failure, skip metrics and fail fast.
- BuiltinUnit detection should use the canonical unit value.
CLI / UX
- Correctness verification enforced by default.
- Optional hidden/debug flag to bypass verification locally (off by default).
- Output should state whether the submission returned unitval directly or required
test.uplc.
- Suggested exit codes:
0: success
2: missing required test.uplc
3: verification failed (non-unit or CEK error after applying verifier)
4: malformed/invalid UPLC file
5: internal tool error
Fixtures (Negative Tests)
- Store negative fixture submissions outside normal submissions:
- Root:
test/fixtures/submissions/{benchmark}/{Compiler}_{version}_{contributor}/
- Normal workflows (listing, measurement reports) ignore fixtures.
cape test enumerates fixtures only and runs verification via the measure tool.
cape test Subcommand
- Implement at
scripts/cape-subcommands/test.
- Discover fixtures under
test/fixtures/submissions/**.
- Run the measure tool’s verification for each fixture; expect at least one negative fixture per benchmark to fail with exit code
3.
- Never persist metrics or generate reports for fixtures.
- Exit non-zero if any verification fails or a required
test.uplc is missing.
File Discovery and Structure
- Verifier program (optional):
scenarios/{benchmark}/test.uplc
- Fixture submissions (isolated):
test/fixtures/submissions/{benchmark}/{Compiler}_{version}_{contributor}/
Example layout:
scenarios/
fibonacci/
test.uplc
factorial/
test.uplc
submissions/
fibonacci/Plinth_1.52.0.0_Unisay/...
factorial/Plinth_1.52.0.0_Unisay/...
test/fixtures/submissions/
fibonacci/BadResult_1.0_Test/...
factorial/BadResult_1.0_Test/...
Canonical Errors
- Missing verifier (required):
Missing test.uplc for benchmark '{benchmark}' at 'scenarios/{benchmark}/test.uplc' (required because the program does not return BuiltinUnit).
- Verification failed:
Verification failed for benchmark '{benchmark}' submission '{id}': (testProgram submissionResult) did not reduce to BuiltinUnit.
- Program already unit:
Submission result is BuiltinUnit; verifier not required.
Deliverables
measure/app/Main.hs: evaluation-first correctness check; conditional verifier application; standardized errors/exit codes.
scenarios/fibonacci/test.uplc and scenarios/factorial/test.uplc (verifiers for non-unit outputs).
- Negative fixtures under
test/fixtures/submissions/{benchmark}/... (at least one per benchmark).
scripts/cape-subcommands/test: fixture discovery and execution; honors exit codes; no metrics/reports.
- Documentation updates (README.md or
doc/usage.md): optionality rule & interface for test.uplc; fixture layout; cape test behavior and failure modes.
Acceptance Criteria
- Unit-returning submissions pass without requiring
test.uplc.
- Correct Fibonacci/Factorial submissions pass with verifier.
- Negative fixtures fail
cape test with exit code 3 and clear errors; exit code 2 when a required verifier is missing.
- Normal submission workflows ignore fixtures.
- CI fails when verification fails or a required
test.uplc is missing.
References
measure/app/Main.hs
scenarios/*/test.uplc
test/fixtures/submissions/
scripts/cape.sh, scripts/cape-subcommands/
.github/copilot-instructions.md
Summary
Ensure benchmark result correctness by adding a verification step to the
measuretool, using an optional per-benchmark verifier program (test.uplc), and acape testsubcommand that runs negative fixture submissions stored outside the mainsubmissions/directory.Motivation
Functional Requirements
scenarios/{benchmark}/test.uplcBuiltinUnit(unitval) on success; MUST error on mismatch.BuiltinUnit(unitval), notest.uplcis required and may be absent.test.uplcis required and must validate the result.Measure Tool Changes (
measure/app/Main.hs)test.uplc.scenarios/{benchmark}/test.uplc(required). If missing, fail with a specific error (see Errors).(testProgram submissionResult)at UPLC level and evaluate with CEK.BuiltinUnit.CLI / UX
test.uplc.0: success2: missing requiredtest.uplc3: verification failed (non-unit or CEK error after applying verifier)4: malformed/invalid UPLC file5: internal tool errorFixtures (Negative Tests)
test/fixtures/submissions/{benchmark}/{Compiler}_{version}_{contributor}/cape testenumerates fixtures only and runs verification via the measure tool.cape testSubcommandscripts/cape-subcommands/test.test/fixtures/submissions/**.3.test.uplcis missing.File Discovery and Structure
scenarios/{benchmark}/test.uplctest/fixtures/submissions/{benchmark}/{Compiler}_{version}_{contributor}/Example layout:
Canonical Errors
Deliverables
measure/app/Main.hs: evaluation-first correctness check; conditional verifier application; standardized errors/exit codes.scenarios/fibonacci/test.uplcandscenarios/factorial/test.uplc(verifiers for non-unit outputs).test/fixtures/submissions/{benchmark}/...(at least one per benchmark).scripts/cape-subcommands/test: fixture discovery and execution; honors exit codes; no metrics/reports.doc/usage.md): optionality rule & interface fortest.uplc; fixture layout;cape testbehavior and failure modes.Acceptance Criteria
test.uplc.cape testwith exit code3and clear errors; exit code2when a required verifier is missing.test.uplcis missing.References
measure/app/Main.hsscenarios/*/test.uplctest/fixtures/submissions/scripts/cape.sh,scripts/cape-subcommands/.github/copilot-instructions.md