Skip to content

Add support for collecting metrics programmatically #33

@VassilisVassiliadis

Description

@VassilisVassiliadis

Issue

I would like to use fms-hf-tuning to collect system level metrics while finetuning models, these include model load time, as well as device related metrics (like these that AIM collects).

One way would be to rely on just the measurements that AIM collects by pointing fms-hf-tuning to an AIM server then contacting AIM to retrieve the data. However, this is a bit restricting in that we cannot collect custom data, collect system metrics at a period other than the one that AIM is using - 30 seconds, or collect data at all if we don't spin up an AIM server.

A more convenient solution would be to allow providing an optional parameter to train() for which can contain a list of callbacks

def train(
model_args: configs.ModelArguments,
data_args: configs.DataArguments,
train_args: configs.TrainingArguments,
peft_config: Optional[Union[peft_config.LoraConfig, peft_config.PromptTuningConfig]] = None,
):

In the same spirit I'd like to get access to the TrainingOutput object that sft_trainer.train() returns here (input_tokens_per_second, train_runtime, etc) :

trainer.train()

(just by returning the output of trainer.train() as the output of train()).

Done when

  • support collecting custom metrics via custom callbacks
  • return the output of trainer.train() to the caller of train()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions