Automated testing of testing tools #469

mzuenni · 2025-10-11T22:23:27Z

Addresses #460

This adds the new command bt check_testing_tool which uses the downloadable samples as well as the files found in the new directory data/testing_tool_test.

This is a bit hacky but should work in most cases?

it assumes that the testing tool is called testing_tool<.ext>
it assumes that the testing tool can be called with -f <test file> <submission run command>
it uses a man-in-the-middle script to get the exit code of the submission and of the testing tool
the MITM script only works if the working directory was not changed by the testing tool
it assumes that the testing tool uses a non-zero exit code if something goes wrong

@paul-wild want to try this?

paul-wild · 2025-10-12T11:43:04Z

I tested this on three interactive problems, Slot Machine (WF'2025 I), Where Am I Now? (WF'2024 L) and Lateral Damage (NWERC'2023 L). It works pretty well already (in particular it was quite easy to rediscover the bugs in the testing tools for the former two), but there are a few things that could/should be improved:

The testing tool might take a different input format than the input validator (this is the case for the latter two examples), so the generate command should not try to validate the data in testing_tool_test.
I think that two very common use cases for testing tool tests are "all secret data" or "all secret data as modified by the following script" and it would be convenient to have a shorthand for that. In the former case you can just copy all generator lines from secret, but then bt generate will shout at you because of duplicated entries. Plus, if you do it manually you need to remember to update this when adding new secret data.
It would be nice if the name and arguments of the testing tool were not assumed to follow a fixed scheme. I could imagine some testing tools whose behaviour may depend on additional flags instead of just an input file. This point is also linked to the previous points; you could have a wrapper script that specifies how to feed a secret case into the testing tool.
When testing this with a buggy testing tool, a submission ran into an infinite loop. Maybe apply a timeout?
I don't know whether you care, but this new command makes the set of bt commands no longer prefixfree (because run also exists).

I did not yet test this on any multipass problems.

mzuenni · 2025-10-12T12:45:21Z

The testing tool might take a different input format than the input validator (this is the case for the latter two examples), so the generate command should not try to validate the data in testing_tool_test.

~~I thought i already disabled all validation... but maybe i only disabled answer valdiation... need to check~~ fixed?

I think that two very common use cases for testing tool tests are "all secret data" or "all secret data as modified by the following script" and it would be convenient to have a shorthand for that.

if its identical, you can use the include feature. For the second, I am unsure if a shorthand for that is good.

When testing this with a buggy testing tool, a submission ran into an infinite loop. Maybe apply a timeout

~~I thought there is a timeout... need to check again~~ fixed?

I don't know whether you care, but this new command makes the set of bt commands no longer prefixfree (because run also exists

hmmm any name recommendation?

mzuenni · 2025-10-12T13:24:56Z

It would be nice if the name and arguments of the testing tool were not assumed to follow a fixed scheme. I could imagine some testing tools whose behaviour may depend on additional flags instead of just an input file.

i think you can already put it inside a directory and add a run script, not sure though

paul-wild · 2025-10-13T10:38:09Z

After the most recent change, I now receive Error occurred during initialization of VM for any Java/Kotlin submissions. To be specific, this is what those submissions output, which of course causes the interaction with the testing tool to fail.

The input files in testing_tool_test now indeed are no longer validated, and a timeout is properly applied.

I think that two very common use cases for testing tool tests are "all secret data" or "all secret data as modified by the following script" and it would be convenient to have a shorthand for that.

if its identical, you can use the include feature. For the second, I am unsure if a shorthand for that is good.

Ah yes, include works really well in the first case. I guess it's alright if the second case doesn't get a shorthand, but it is quite inconvenient to achieve this effect currently, unless I'm missing a cleaner method. I basically placed the transformed inputs inside a subdirectory of generators/ and used copy, but of course this results in a lot of clutter.

It would be nice if the name and arguments of the testing tool were not assumed to follow a fixed scheme. I could imagine some testing tools whose behaviour may depend on additional flags instead of just an input file.

i think you can already put it inside a directory and add a run script, not sure though

But you wouldn't want that stuff in the attachments/ directory, right? So I'm not sure I understand this comment.

To give some examples for why I think more flexibility would be nice:

If you have more than one interactive/multipass problem in a contest, you might use <problemname>_testing_tool.py to differentiate them.
Existing testing tools might use some different conventions. The -f flag is kind of redundant, so a testing tool might not actually have it. A testing tool might also not want to use input files, e.g. if the input is just a single integer, which you could then specify as command line argument instead.

Of course one can always rewrite testing tools to fit the given format, but if one was instead able to specify where the testing tool is located and how it should be run, then you could perhaps do something like this inside the testing_tool_test entry in the generators.yaml (or even for each testdata group inside of it):

# no -f needed, can specify path
testing_tool: attachments/slotmachine_testing_tool.py testcase.in {solution}

# chaining with an input transformation script
testing_tool: generators/transform-input.py < testcase.in > testcase.in.transformed; attachments/whereaminow_testing_tool.py testcase.in.transformed {solution}

paul-wild · 2025-10-13T10:44:46Z

Though thinking about it a bit more, for newly developed interactive problems it should always be possible to choose the input format such that any valid input for the testing tool is also valid input for the interactor, even if the latter features additional behaviour such as adaptiveness or different strategies of playing a game.

mzuenni · 2025-10-13T11:57:16Z

After the most recent change, I now receive Error occurred during initialization of VM for any Java/Kotlin submissions.

ahrg... thats because jvm does not handle memory restrictions and our exec call does not see the submission but only the testing_tool. will be fixed

mzuenni · 2025-10-13T13:19:16Z

for newly developed interactive problems it should always be possible to choose the input format such that any valid input for the testing tool is also valid input for the interactor

yes and I think the same is true for the -f flag. I would also argue that this makes usage easier for teams (if we always use the same arguments)

there is still the issue with the name... run is really commonly used so a different prefix would be nice, but bt test also already exists...

…sed to Popen

RagnarGrootKoerkamp

Didn't have a close look at testing_tool.py:run but otherwise lgtm.

You should add some docs though to explain precisely what is run when doing bt check_testing_tool, and what is the expected invocation of the testing tool.

bin/problem.py

bin/testing_tool.py

mzuenni added 2 commits October 12, 2025 00:15

add testing tool tests

36e07be

rename to resolve conflicts

ce0bfa6

mzuenni added 2 commits October 12, 2025 14:53

dont validate testing tool input

e40dc81

use program exec method to apply timeout

6a04d5a

mzuenni added 3 commits October 13, 2025 13:58

disable memory limit

e81f955

use memory limit

f8c4132

fix sanitizer for interactive problems

340a289

mzuenni added 8 commits October 13, 2025 15:27

preexec_fn = callable was never used and the callable is not even pas…

92ce0e1

…sed to Popen

cleanup

b32aa4b

more cleanup

bf64a5c

update tests

935caa2

update name and tests

8e74cbc

remove kotlin submission

3413801

dedublicate mem check

3c2eee4

cleaner check

f92c071

mzuenni requested a review from RagnarGrootKoerkamp October 20, 2025 15:26

RagnarGrootKoerkamp approved these changes Oct 20, 2025

View reviewed changes

bin/problem.py Show resolved Hide resolved

bin/testing_tool.py Show resolved Hide resolved

mzuenni added 2 commits October 20, 2025 21:18

fix prints

87d5e2b

add doc

c4bf08d

mzuenni merged commit 5eff3a0 into main Oct 20, 2025
6 checks passed

mzuenni deleted the testing-tool-test branch October 20, 2025 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automated testing of testing tools #469

Automated testing of testing tools #469

mzuenni commented Oct 11, 2025 •

edited

Loading

Uh oh!

paul-wild commented Oct 12, 2025

Uh oh!

mzuenni commented Oct 12, 2025 •

edited

Loading

Uh oh!

mzuenni commented Oct 12, 2025

Uh oh!

paul-wild commented Oct 13, 2025

Uh oh!

paul-wild commented Oct 13, 2025

Uh oh!

mzuenni commented Oct 13, 2025 •

edited

Loading

Uh oh!

mzuenni commented Oct 13, 2025

Uh oh!

RagnarGrootKoerkamp left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Automated testing of testing tools #469

Automated testing of testing tools #469

Conversation

mzuenni commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paul-wild commented Oct 12, 2025

Uh oh!

mzuenni commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Oct 12, 2025

Uh oh!

paul-wild commented Oct 13, 2025

Uh oh!

paul-wild commented Oct 13, 2025

Uh oh!

mzuenni commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Oct 13, 2025

Uh oh!

RagnarGrootKoerkamp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mzuenni commented Oct 11, 2025 •

edited

Loading

mzuenni commented Oct 12, 2025 •

edited

Loading

mzuenni commented Oct 13, 2025 •

edited

Loading