Skip to content

best_metric_key cannot be set in config #971

@jbross-ibm-research

Description

@jbross-ibm-research

Describe the bug

Exception CheckpointingConfig.__init__() got an unexpected keyword argument 'best_metric_key' when trying to set best_metric_key in config.


In https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/recipes/llm/train_ft.py#L1044,
best_metric_key is read from configuration (e.g. to select a best metric when computing validation loss on multiple validation sets) like this:

self.best_metric_key = self.cfg.get("checkpoint.best_metric_key", "default")

However, this conflicts with the dataclass definition in https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/components/checkpoint/checkpointing.py#L78, which does not support the best_metric_key key/parameter.

Adding the param to the config like

checkpoint:
  enabled: true
  ...
  best_metric_key: my_best_metric

leads to exception CheckpointingConfig.__init__() got an unexpected keyword argument 'best_metric_key'.

Amending the dataclass with the best_metric_key would be a quick fix, but unclear if that is a desired solution.

Steps/Code to reproduce bug

Try using best_metric_key in config like this:

checkpoint:
  enabled: true
  ...
  best_metric_key: my_best_metric

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions