Skip to content

Use PSS to monitor memory usage#23

Open
boesr wants to merge 3 commits intochpc-uofu:mainfrom
boesr:main
Open

Use PSS to monitor memory usage#23
boesr wants to merge 3 commits intochpc-uofu:mainfrom
boesr:main

Conversation

@boesr
Copy link
Contributor

@boesr boesr commented Jan 20, 2026

This PR introduces a new configuration option, CGROUP_WARDEN_IGNORE_CACHE, which changes how memory usage is calculated for cgroups. When enabled, the warden will calculate memory usage based on the sum of PSS (Proportional Set Size) from all processes within the cgroup, instead of relying on the default cgroup memory statistics which include the filesystem cache.

On systems with high I/O (e.g. export nodes for file transfers), the Linux kernel's page cache can grow significantly. Since cgroup v2 includes this cache in the memory.current statistics, the cgroup-warden might report high memory usage and trigger limits, even if the actual application memory (RSS/PSS) is well within limits.

Previously, users often received violation emails for memory usage that included the filesystem cache. However, since the cache was not explicitly shown or broken down in the attached usage diagrams, it was confusing for users to understand why they were flagged for a policy violation. By switching to PSS-based reporting, we ensure that the metrics and alerts align with the actual memory pressure caused by the user's processes.

@jay-mckay jay-mckay self-requested a review January 21, 2026 22:27
@jay-mckay
Copy link
Contributor

Thanks for the PR. We had not realized that the file cache could become as large as you have described. After some discussion, we believe that removing the filesystem cache from reported memory usage is the preferred way to account for memory usage, as it unpredictable and out of the users direct control. Because of this, we should modify the collectors for both V1 and V2 to use a summed PSS for the cgroup_warden_memory_usage_bytes value, much like you have done.

This will not need to be configured, so we can remove the extra configuration variable you have added. We can also streamline the summation of the process PSS by reusing the already collected process information, i.e.

var totalPSS float64 // new
for name, p := range procs {
    totalPSS += p.memoryPSSTotal // new
	ch <- prometheus.MustNewConstMetric(c.procCPU, prometheus.CounterValue, float64(p.cpuSecondsTotal), cg, info.Username, name)
	ch <- prometheus.MustNewConstMetric(c.procMemory, prometheus.GaugeValue, float64(p.memoryBytesTotal), cg, info.Username, name)
	ch <- prometheus.MustNewConstMetric(c.procPSS, prometheus.GaugeValue, float64(p.memoryPSSTotal), cg, info.Username, name)
	ch <- prometheus.MustNewConstMetric(c.procCount, prometheus.GaugeValue, float64(p.count), cg, info.Username, name)
}
ch <- prometheus.MustNewConstMetric(c.memoryUsage, prometheus.GaugeValue, totalPSS, cg, info.Username, name) // new

We can make these changes as well, if you would prefer.

Copy link
Contributor

@jay-mckay jay-mckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boesr
Copy link
Contributor Author

boesr commented Jan 22, 2026

I reverted the previous changes and adjusted the collector to calculate the total PSS. I still need to test it, but this includes the changes you requested.

@boesr
Copy link
Contributor Author

boesr commented Jan 22, 2026

I have just tested it on one of our export nodes. Everything seems to be working, and the cache is no longer included.

@jay-mckay
Copy link
Contributor

Changes look good to me. I will do some testing on some of our systems as well to double check, and then we can merge this in.

@jay-mckay jay-mckay changed the title Add support for ignoring memory cache via PSS summation (optionally) Use PSS to monitor memory usage Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants