-
Notifications
You must be signed in to change notification settings - Fork 363
Description
Hi, @yifanmai!
Sorry for the delayed response, but getting back to the discussion [pull request], we would like to follow your suggestion to make the implementation more efficient and better integrated into HELM.
Proposal
We are planning to refactor our solution and turn the contamination index computation into a scenario, instead of treating it as a nested HELM run. The proposed command for execution will be:
helm-run --run-entries contamination:dataset=bluex,model=ibm/granite-3.3-8b-instruct,strategy=ts_guessing_question_multichoice,language=pt --suite my-suite --max-eval-instances 1500
In this new format:
- contamination indicates that the model contamination will be computed for a given scenario;
- dataset specifies the scenario being used;
- strategy defines the contamination method applied;
- model indicates the model to be evaluated;
- language specifies the prompt language.
Impact
No existing files in HELM will need to be modified — only new files will be added, following the same pattern used for adding new scenarios. The following files will be included:
- contamination_run_specs.py
- contamination_scenario.py
- test_contamination_scenario.py
However, since the contamination scenario needs to access one of HELM’s datasets to perform word/option masking, it will also require adding a few auxiliary files:
- contamination_utils.py
- contamination_base.py
- prompt_translations.py
- ts_guessing_question_based.py
- ts_guessing_question_multichoice.py
These additional files ensure the full functionality of the contamination computation.
Thus, according to the proposal, contamination becomes a scenario, and the specified dataset acts as a meta-scenario.
The new file structure is organized as follows:
src/
└── helm/
├── benchmark/
├── run_specs/
contamination_run_specs.py
└── scenarios/
└──⭐ contamination/
contamination_scenario.py
contamination_utils.py
contamination_base.py
prompt_translations.py
ts_guessing_question_based.py
ts_guessing_question_multichoice.py
test_contamination_scenario.py
We would like to confirm whether this proposal is aligned with HELM’s architecture and whether you think we can proceed with refactoring the module in this direction.