You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:
This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm.
I would love to hear if it make sense to think about enhancing this approach.
Thanks
PS: It might make sense to include this in mlflow integration tutorials
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:
import mlflow as resource_monitor
resource_monitor.set_tracking_uri(mlflow_uri)
resource_monitor.set_experiment(experiment_name=exp_name)
resource_monitor.set_system_metrics_sampling_interval(interval=2)
resource_monitor.start_run(log_system_metrics=True)
run_name = resource_monitor.active_run().info.run_name
and then for validation and training similarly as
mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)
resource_monitor.stop_run()
This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm.
I would love to hear if it make sense to think about enhancing this approach.
Thanks
PS: It might make sense to include this in mlflow integration tutorials
Beta Was this translation helpful? Give feedback.
All reactions