Monitoring system resources during training using MLFlow #7404

kavmar · 2024-01-18T13:28:34Z

kavmar
Jan 18, 2024

Hi,

I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:

import mlflow as resource_monitor

resource_monitor.set_tracking_uri(mlflow_uri)
resource_monitor.set_experiment(experiment_name=exp_name)
resource_monitor.set_system_metrics_sampling_interval(interval=2)
resource_monitor.start_run(log_system_metrics=True)
run_name = resource_monitor.active_run().info.run_name

and then for validation and training similarly as

mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)
resource_monitor.stop_run()

This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm.
I would love to hear if it make sense to think about enhancing this approach.

Thanks

PS: It might make sense to include this in mlflow integration tutorials

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Monitoring system resources during training using MLFlow #7404

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Monitoring system resources during training using MLFlow #7404

Uh oh!

kavmar Jan 18, 2024

Replies: 0 comments

kavmar
Jan 18, 2024