Skip to content

Question: Specifying a custom set of metrics to helm-run. #3915

@Erotemic

Description

@Erotemic

I'm interested in running a subset of the HEIM benchmarks, but I'm not interested in all of the metrics. Currently I'm running:

helm-run --run-entries mscoco:model=huggingface/stable-diffusion-v1-4 \
    --suite my-heim-suite \
    --max-eval-instances 10 --num-threads 1

I see in helm.benchmark.run_speces.heim_run_specs there is get_mscoco_spec which is registered as a run spec function and ultimately calls into get_core_heim_metric_specs to create the metric class instances. In this example the final constructed run spec contains these metrics:

metric_specs=[MetricSpec(class_name='helm.benchmark.metrics.image_generation.lpips_metrics.LearnedPerceptualImagePatchSimilarityMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.multi_scale_ssim_metrics.MultiScaleStructuralSimilarityIndexMeasureMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.psnr_metrics.PeakSignalToNoiseRatioMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.uiqi_metrics.UniversalImageQualityIndexMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.aesthetics_metrics.AestheticsMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.clip_score_metrics.CLIPScoreMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.efficiency_metrics.EfficiencyMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.fractal_dimension_metric.FractalDimensionMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.image_generation.watermark_metrics.WatermarkMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.basic_metrics.BasicGenerationMetric', args={'names': []}), MetricSpec(class_name='helm.benchmark.metrics.basic_metrics.BasicReferenceMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.basic_metrics.InstancesPerSplitMetric', args={})]

But I don't think there is any way for me to modify that run entry description to impact the metrics? I think the only items I can specify there are one of the keys from the RUN_EXPANDERS (i.e.)

    'instructions'
    'prompt'
    'newline'
    'stop'
    'format_prompt'
    'follow_format_instructions'
    'add_to_stop'
    'global_prefix'
    'num_train_trials'
    'max_train_instances'
    'max_eval_instances'
    'num_outputs'
    'num_trials'
    'model'
    'model_deployment'
    'data_augmentation'
    'tokenizer'
    'num_prompt_tokens'
    'num_output_tokens'
    'chatml'
    'eval_split'
    'output_format_instructions'
    'temperature'
    'increase_temperature'
    'increase_max_tokens'
    'process_output'

Or a key that agrees with an argument from the function wrapped with run_spec_function("mscoco"), which in this case is get_mscoco_spec.

So my thought is I have to create a custom run spec, and I worked on following the information here: https://crfm-helm.readthedocs.io/en/latest/adding_new_scenarios/#custom-run-spec-function (this doc could use a bit more detail, and ideally a small tutorial).

I created a file in the current working directory:

echo '
from helm.benchmark.run_spec import RunSpec, run_spec_function
from helm.benchmark.adaptation.adapter_spec import AdapterSpec
from helm.benchmark.metrics.metric import MetricSpec
from helm.benchmark.run_specs.classic_run_specs import get_basic_metric_specs
from helm.benchmark.scenarios.scenario import ScenarioSpec
from typing import List
from helm.benchmark.run_specs.heim_run_specs import get_image_generation_adapter_spec


def get_my_core_heim_metric_specs() -> List[MetricSpec]:
    """Evaluate every image with these set of metrics."""
    return [
        MetricSpec(class_name="helm.benchmark.metrics.image_generation.clip_score_metrics.CLIPScoreMetric", args={}),
    ] + get_basic_metric_specs(names=[])


@run_spec_function("my_mscoco")
def get_my_mscoco_spec(
    for_efficiency: bool = False,
    compute_fid: bool = False,
    run_human_eval: bool = False,
    num_human_examples: int = 100,
    use_perturbed: bool = False,
    skip_photorealism: bool = False,
    skip_subject: bool = False,
) -> RunSpec:
    scenario_spec = ScenarioSpec(
        class_name="helm.benchmark.scenarios.image_generation.mscoco_scenario.MSCOCOScenario", args={}
    )

    adapter_spec: AdapterSpec
    metric_specs: List[MetricSpec]
    run_spec_name: str

    adapter_spec = get_image_generation_adapter_spec(num_outputs=4)
    metric_specs = get_my_core_heim_metric_specs()
    run_spec_name = "my_mscoco"

    return RunSpec(
        name=run_spec_name,
        scenario_spec=scenario_spec,
        adapter_spec=adapter_spec,
        metric_specs=metric_specs,
        groups=[run_spec_name],
    )
' > helm_my_run_specs.py

Using helm-run by itself did not see the new file even though it conformed to the naming scheme, but I think that is a python issue. When I explicitly set PYTHONPATH=. it picked up the new spec. And I was able to run:

PYTHONPATH=. helm-run --run-entries my_mscoco:model=huggingface/stable-diffusion-v1-4 --suite my-heim-suite --max-eval-instances 10 --num-threads 1 --log-config ./helm_debug_log_config.yaml

But this does output the results in a different folder. I'm not sure if setting the name to "mscoco" would cause issues or not.

I'm wondering if:

  1. Is this the best way to compute a subset of metrics for an existing run spec?
  2. Would it be worth adding something to HELM to make it easier to point at a specific file that contains the custom run specs rather than relying on a fragile naming scheme? (Also, I'm not even sure where the helm_*_run_specs pattern matching is happening, the only thing I see to discover run specs is in discover_run_spec_functions). But it might be a lot more intuitive if there was an argument or environment variable that let the user specify the module to import that contained their customized run specs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions