Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions env/common/templates/galaxy/config/tpv/defaults.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ tools:
time: null
default_time: "36:00:00"
xdg_cache_home: null
params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That ends up in the galaxy database, doesn't it ?
I would not recommend this, does this offer anything the cgroup metrics don't ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the goal is to know how much we allocated originally, and compare it to the cgroup metrics to determine wastage. I don't think there's any easy way to figure that out from the cgroup. And the cgroup metrics will also eventually end up in the database also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that depends on what you record in the cgroup metrics, we have both allocated and peak memory and allocated CPU and runtime. Yes, cgroup metrics are already in the database, but we should not add more if it's redundant information. I think we already record everything we need for regression analysis, I made a start in https://github.com/mvdbeek/tpv-regression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! This should fit in nicely at some point with: galaxyproject/tpv-shared-database#64 and https://github.com/nuwang/tpv-db-optimizer
How are you getting allocated CPU and memory? Which cgroup metrics are you using?

Copy link
Member Author

@nuwang nuwang Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ultimately use api/jobs/{job_id}/metrics, but quick and dirty I just use gxadmin. The metrics are not customized, it's just the standard set, for memory it's galaxy_memory_mb and for cpus galaxy_slots

tpv_cores: '{cores}'
tpv_gpus: '{gpus}'
tpv_mem: '{mem}'
env:
- execute: ulimit -c 0
# tools in the shared DB override this, which then breaks some tools, so set on the destinations instead for
Expand Down