Skip to content

Refactoring the Contamination Scenario #3914

@IriedsonSouto

Description

@IriedsonSouto

Hi, @yifanmai!

Sorry for the delayed response, but getting back to the discussion [pull request], we would like to follow your suggestion to make the implementation more efficient and better integrated into HELM.

Proposal

We are planning to refactor our solution and turn the contamination index computation into a scenario, instead of treating it as a nested HELM run. The proposed command for execution will be:

helm-run --run-entries contamination:dataset=bluex,model=ibm/granite-3.3-8b-instruct,strategy=ts_guessing_question_multichoice,language=pt --suite my-suite --max-eval-instances 1500

In this new format:

  • contamination indicates that the model contamination will be computed for a given scenario;
  • dataset specifies the scenario being used;
  • strategy defines the contamination method applied;
  • model indicates the model to be evaluated;
  • language specifies the prompt language.

Impact

No existing files in HELM will need to be modified — only new files will be added, following the same pattern used for adding new scenarios. The following files will be included:

  • contamination_run_specs.py
  • contamination_scenario.py
  • test_contamination_scenario.py

However, since the contamination scenario needs to access one of HELM’s datasets to perform word/option masking, it will also require adding a few auxiliary files:

  • contamination_utils.py
  • contamination_base.py
  • prompt_translations.py
  • ts_guessing_question_based.py
  • ts_guessing_question_multichoice.py

These additional files ensure the full functionality of the contamination computation.
Thus, according to the proposal, contamination becomes a scenario, and the specified dataset acts as a meta-scenario.

The new file structure is organized as follows:

src/
└── helm/
    ├── benchmark/
        ├── run_specs/
            contamination_run_specs.py
        └── scenarios/
            └──⭐ contamination/
                contamination_scenario.py
                contamination_utils.py
                contamination_base.py
                prompt_translations.py
                ts_guessing_question_based.py
                ts_guessing_question_multichoice.py
                test_contamination_scenario.py

We would like to confirm whether this proposal is aligned with HELM’s architecture and whether you think we can proceed with refactoring the module in this direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions