Skip to content

Mcollina perf improvements migration#738

Open
jdmarshall wants to merge 5 commits intosiimon:masterfrom
jdmarshall:mcollinaPerf
Open

Mcollina perf improvements migration#738
jdmarshall wants to merge 5 commits intosiimon:masterfrom
jdmarshall:mcollinaPerf

Conversation

@jdmarshall
Copy link
Contributor

@jdmarshall jdmarshall commented Jan 3, 2026

This represents a cherry-pick of the changes in #727 with a few modifications to separate the testing changes and account for PRs of mine that already merged.

In most cases I preserved @mcollina 's commits, in some cases with deletions to split problem code from otherwise serviceable code.

Unfortunately this has removed:

  • The Commit to migrate the tests. This turned out to be illuminating because dropping it turned up multiple regressions
  • CLAUDE.md
  • the tdigest changes have a net negative effect on the most commonly called function: push() - false economy
  • the findBounds() changes do not satisfy the unit tests on trunk, and are setting the wrong fields (fencepost error?)
  • the keyFrom changes are generating '|undefined|' values
    • this undermines the memory improvements of the previous code in favor of microbenchmarks
    • I only seeing a few % performance improvement in the microbenchmarks, and the space tradeoff will eat those
  • osMemoryHeapLinux.js is ignoring the inline comment claiming this is a non-blocking operation, and depends on the order in which the stats are evaluated and thus may be brittle
  • The LabelMap key sorting change is breaking a test.
    • If this class is not being exposed in the public API, that may be okay. I need to double check. This should be in its own PR if true. But I'm unclear how much it's actually helping.
    • It does improve constructor time but those are not actually fired that often, and we trade off a potential data sharing problem.
    • Removing list comprehensions is not a substantial performance improvement and a substantial reduction in code legibility. Dropped or chopped several commits which simply changed to for loops.
  • The async optimization includes code that assumes that collect() calls are made in a certain order. The rest of this could also be a separate PR but I didn't want to tangle with it.

Conclusion

I think there are two or three commits that warrant their own pull requests to discuss the merits of each as a separate commit. Migrating out of Jest is interesting but not something that should be done by Claude as it is clearly not doing any red-green-refactoring on the tests to validate that they are still testing anything. Also there are 5 more months of Node 20 to deal with before describe() works. And asserts have the worst DX of any matcher library. I would use chai.expect instead (and have on other projects).

What is retained here are the things that have no obvious bugs and improved the code quality, regardless of whether they had much net effect on performance.

Special mention:

optimizing for empty labels.

This is a bad plan.

Stats with no labels are already many times faster than stats with labels (though the benchmarks misreport it as 600x, it's much smaller), and before I did the label code rework that multiplier was much higher. Since prometheus really is fairly useless with stats with zero labels (not even server name or app version??), there is absolutely no value in my mind in making 0-labels 25% faster if it makes stats with labels 1% slower. You will have thousands of the latter and a handful of the former.

I looked at this several times and opted to go the opposite direction to speed up labels, which the release version is IIRC more like 900x slower. Matteo did not quite have all of these changes when he started his.

Net effect

Once you factor in some changes that Matteo noticed but which were already in my other pull requests that have since merged, removing code with bugs, and importantly after you switch to using trunk as the baseline to eliminate cross-branch confusion, the net effect here is small. But these are the changes that seemed like they might help in real world scenarios instead of just in microbenchmarks.

Final outcome, with latest removed to reduce confusion. Mostly in the noise floor with a few possible % here and there.

counter ⇒ new

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 927,137 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 947,534 ops/sec | 14 samples (1.02x faster)

counter ⇒ inc

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 15,241,474 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████─▕ 14,653,285 ops/sec | 12 samples (1.04x slower)

counter ⇒ inc with labels

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 25,284 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 25,071 ops/sec | 10 samples (1.01x slower)

gauge ⇒ inc

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏███████████████████████▌─▕ 16,650,650 ops/sec |  8 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 17,447,399 ops/sec | 13 samples (1.05x faster)

gauge ⇒ inc with labels

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 140,215 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 139,564 ops/sec | 11 samples (1.00x slower)

histogram ⇒ observe#1 with 64

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏███████████████████████▌─▕ 124,410 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 130,695 ops/sec | 12 samples (1.05x faster)

histogram ⇒ observe#2 with 8

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 97,870 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████─▕ 94,061 ops/sec | 10 samples (1.04x slower)

histogram ⇒ observe#2 with 4 and 2 with 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 51,122 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 51,717 ops/sec | 11 samples (1.01x faster)

histogram ⇒ observe#2 with 2 and 2 with 4

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 50,296 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 50,696 ops/sec | 10 samples (1.01x faster)

histogram ⇒ observe#6 with 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 39,667 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 39,603 ops/sec | 10 samples (1.00x slower)

histogram ⇒ startTimer#1 with 64

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 60,210 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 61,351 ops/sec | 10 samples (1.02x faster)

histogram ⇒ startTimer#2 with 8

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 52,742 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 53,690 ops/sec | 10 samples (1.02x faster)

histogram ⇒ startTimer#2 with 4 and 2 with 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 34,852 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 35,461 ops/sec | 11 samples (1.02x faster)

histogram ⇒ startTimer#2 with 2 and 2 with 4

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 34,517 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 34,907 ops/sec | 11 samples (1.01x faster)

histogram ⇒ startTimer#6 with 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 29,454 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 30,463 ops/sec |  9 samples (1.03x faster)

util ⇒ hashObject

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 4,261,424 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 4,208,851 ops/sec |  8 samples (1.01x slower)

util ⇒ LabelMap.validate()

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 16,797,115 ops/sec | 12 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 16,512,785 ops/sec | 10 samples (1.02x slower)

util ⇒ LabelMap.keyFrom()

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 7,209,639 ops/sec | 13 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 7,227,734 ops/sec | 12 samples (1.00x faster)

summary ⇒ observe#1 with 64

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 105,703 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 103,763 ops/sec | 10 samples (1.02x slower)

summary ⇒ observe#2 with 8

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 80,878 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 81,213 ops/sec | 10 samples (1.00x faster)

summary ⇒ observe#2 with 4 and 2 with 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 46,491 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 45,926 ops/sec | 10 samples (1.01x slower)

summary ⇒ observe#2 with 2 and 2 with 4

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 44,956 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 44,600 ops/sec | 11 samples (1.01x slower)

summary ⇒ observe#6 with 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 36,361 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 36,595 ops/sec | 10 samples (1.01x faster)

registry ⇒ getMetricsAsJSON() no labels

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 427,999 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████─▕ 416,444 ops/sec | 11 samples (1.03x slower)

registry ⇒ getMetricsAsJSON() 1 x 64

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 8,138 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 8,211 ops/sec |  9 samples (1.01x faster)

registry ⇒ getMetricsAsJSON() 2 x 4

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 25,853 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 25,838 ops/sec |  9 samples (1.00x slower)

registry ⇒ getMetricsAsJSON() 2 x 8

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 7,343 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 7,286 ops/sec | 10 samples (1.01x slower)

registry ⇒ getMetricsAsJSON() 6 x 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 4,646 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 4,689 ops/sec | 10 samples (1.01x faster)

registry ⇒ getMetricsAsJSON() 2 x 4, 2 defaults

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 16,838 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 17,069 ops/sec | 10 samples (1.01x faster)

registry ⇒ getMetricsAsJSON() 2 x 2, 4 defaults

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 66,602 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 66,161 ops/sec | 11 samples (1.01x slower)

registry ⇒ metrics() no labels

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 166,510 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 169,287 ops/sec | 10 samples (1.02x faster)

registry ⇒ metrics() no labels and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 161,218 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 165,202 ops/sec | 12 samples (1.02x faster)

registry ⇒ metrics() 1 x 64

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 2,840 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 2,921 ops/sec |  9 samples (1.03x faster)

registry ⇒ metrics() 1 x 64 and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 2,833 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 2,923 ops/sec |  9 samples (1.03x faster)

registry ⇒ metrics() 2 x 4

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 10,901 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 11,082 ops/sec | 10 samples (1.02x faster)

registry ⇒ metrics() 2 x 4 and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏█████████████████████████▕ 10,829 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏████████████████████████▌▕ 10,662 ops/sec |  9 samples (1.02x slower)

registry ⇒ metrics() 2 x 8

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 3,110 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 3,218 ops/sec | 10 samples (1.03x faster)

registry ⇒ metrics() 2 x 8 and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 3,139 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 3,157 ops/sec |  9 samples (1.01x faster)

registry ⇒ metrics() 6 x 2

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 2,449 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 2,544 ops/sec | 10 samples (1.04x faster)

registry ⇒ metrics() 6 x 2 and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 2,443 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 2,528 ops/sec | 10 samples (1.03x faster)

registry ⇒ metrics() 2 x 4, 2 defaults

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████▌▕ 6,914 ops/sec |  9 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 6,961 ops/sec | 10 samples (1.01x faster)

registry ⇒ metrics() 2 x 4, 2 defaults and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 6,906 ops/sec | 10 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 7,102 ops/sec | 10 samples (1.03x faster)

registry ⇒ metrics() 2 x 2, 4 defaults

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 26,211 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 26,875 ops/sec |  8 samples (1.03x faster)

registry ⇒ metrics() 2 x 2, 4 defaults and openMetrics

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 26,328 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 27,376 ops/sec |  9 samples (1.04x faster)

cluster ⇒ aggregate()

Summary (vs. baseline):
 ⇒ prom-client@trunk         ▏████████████████████████─▕ 8.46 ops/sec | 11 samples (baseline)
 ⇒ prom-client@current       ▏█████████████████████████▕ 8.64 ops/sec | 11 samples (1.02x faster)


No significant regressions found.

@jdmarshall jdmarshall force-pushed the mcollinaPerf branch 2 times, most recently from a3f2ce7 to 39af207 Compare January 3, 2026 07:29
@jdmarshall
Copy link
Contributor Author

jdmarshall commented Jan 3, 2026

Additional notes:

TDigest.push() is consistently testing slower but all of the other functions are faster. I don't know if that means it should be in, or out.

Eliminating the getMetricsAsArray() call doesn't have a measurable improvement in performance, but it does reduce allocations so I retained it.

Empty Set checks also tangle with another PR I already got merged, so that gets watered down too.

Update: After further consideration, push() is the hot path and making it slower is a bad idea. Dropped.

jdmarshall and others added 3 commits January 3, 2026 01:13
continuing to attempt to merge Matteo's commits.

Remove kludge used in benchmark-regression code to see into the utils.
Faceoff has a better way to handle this.
- Refactor escapeLabelValue() and escapeString() to single-pass traversal
- Eliminate multiple .replace() calls in string escaping

Performance improvements:
- ~4% improvement in registry.metrics() serialization (3,160 -> 3,298 calls/sec)
- Reduced average time per call from 0.316ms to 0.303ms
- More efficient string processing with switch-based character escaping

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Modified by Jason Marshall <jdmarshall@users.noreply.github.com>
Replace this.getMetricsAsArray() with direct iteration over
this._metrics.values() to eliminate unnecessary Array.from() conversion.

Performance improvement: ~1.3% faster (3,216 -> 3,259 calls/sec)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jdmarshall jdmarshall marked this pull request as ready for review January 3, 2026 09:29
@jdmarshall
Copy link
Contributor Author

A silver lining of this effort is that while trying to conserve the bintree and tdigest benchmarks @mcollina wrote, I ported them to faceoff so that we could get deltas between branches. That effort surfaced a rather substantial ergonomics problem with subsuites and I've opened a ticket to make that better.

cobblers-children/faceoff#32

…st migration

Comprehensive changelog update including:
- Performance optimizations (promise allocation, array operations, histogram, tdigest)
- Various bug fixes and refactoring

Covers commits from f6dc1a3 to 4d589c6 (17 commits total).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Modified by: Jason Marshall <jdmarshall@users.noreply.github.com>
@mcollina
Copy link

mcollina commented Jan 3, 2026

  • the tdigest changes have a net negative effect on the most commonly called function: push() - false economy

I do not recall exactly, but I would need to analyze again.

  • the findBounds() changes do not satisfy the unit tests on trunk, and are setting the wrong fields (fencepost error?)

This is actual one of the most significant optimizations.

  • the keyFrom changes are generating '|undefined|' values

Yes. This is critical to improve performance.

  • this undermines the memory improvements of the previous code in favor of microbenchmarks

Most of those changes were measured in real apps as well.

  • osMemoryHeapLinux.js is ignoring the inline comment claiming this is a non-blocking operation, and depends on the order in which the stats are evaluated and thus may be brittle

It was showing up in the flamegraph as a hot operation. Moving to async it made it disappear. Easy choice.

  • The LabelMap key sorting change is breaking a test.

This was needed to fix the keyFrom changes.

  • Removing list comprehensions is not a substantial performance improvement and a substantial reduction in code legibility. Dropped or chopped several commits which simply changed to for loops.

We are not in agreement.

  • The async optimization includes code that assumes that collect() calls are made in a certain order. The rest of this could also be a separate PR but I didn't want to tangle with it.

This is a critical optimization.

@jdmarshall
Copy link
Contributor Author

jdmarshall commented Jan 3, 2026

I would encourage you to file findBpunds() as its own PR, but with pinning tests and fixing the issues with the existing tests. Which you didn’t see to be running as you went along? 61 files is too much for a single PR. And red builds cannot be merged regardless.

I agree that it’s an important change. I wrote #671.

It was showing up in the flamegraph as a hot operation. Moving to async it made it disappear. Easy choice.

I will have to test this and get back to you. You also have to be careful you aren’t triggering dead code elimination in benchmarks but that’s probably not what is happening here.

Yes. This is critical to improve performance [of keyFrom]

keyFrom() as you see it was written by me. The changes I made were to answer complaints of high memory usage by prom-client. Yes it reduces string interpolation operations and so is a net performance improvement. Expanding the key names again undoes my most important contribution to this code base and really makes the whole change about as bad if not worse than the original. You are missing that I had already optimized the empty check in another PR that has now landed2 and that erodes most of the gains made in the rest of the function.

for loop

I’ve tested for loops versus list comprehensions extensively in NodeJS. Particularly under benchmark.js, in production code, in production batch processing, and with tinybench. Yes they used to be quite slow especially in hot paths. Part of that has been fixed in later v8 versions. Part of it is fixed by using arrow functions. Part of it is fixed by not touching arguments. It ends up being swamped by the complexity of the operation inside the loop, especially if it ends up touching a lot of data. Particularly if you limit your discussion to Node 20 and later. It’s true that under 16 and 14 this gap was wider. But V8 is pretty good at this stuff.

If the loop code were having a large effect you would see it show up under the aggregate tests like summary and registry and I’m just not seeing it under bench-node. If you use trunk as the baseline instead of latest, it’s easier to see what’s new versus whats already landed.

My MO is to make all of the performance optimizations that also improve the code quality, regardless of how well or poorly they benchmark. 2% gains done a dozen times is still a 25% improvement and I will take that with few caveats. Nobody can really argue with better code even if the 2% seems like a waste of my time. What matters though is how much QA is necessary to validate the changes and for that you restrict your changes to one functional area at a time instead of the entire codebase.

As a general rule I avoid reduce() and forEach(). Without the transformed array in the intermediate step, the performance gap starts to matter more, and the legibility advantage becomes suspect. reduce is for legibility reasons. The noun of the reduce verb is lost at the end of the action and I’ve seen it lead to reading comprehensions problems. Except in Elixir, where the argument order is reversed.

In batch processing unrolling a map() to a for of can improve throughput and throttle fanout in async functions. But that’s much more about network latency and congestion. It’s fundamentally changing the order of operations.

https://benchmarklab.azurewebsites.net/Benchmarks/Show/34274/1/for-in-loop-vs-map#latest_results_block be sure to look at only the Chrome results. Since that’s what’s relevant to nodeJS performance. But I’m seeing similar results in Safari which would cover Bun.

Stats should ideally be 5% of your application workload and no more. So 2% of 5% is lost in the noise floor. I had a large code base where the cross cutting concerns, including stats, was closer to 20%, and I did enough work to find 90 ms in response time. I didn’t have to murder any map() calls to get there. Well, one, but that was merging two maps into one. I did have to upgrade nodeJS for that to be true though.

@jdmarshall
Copy link
Contributor Author

As far as I’m aware prom-client makes no guarantees about what order stats will be evaluated in and in fact #692 alters the order they are collected and achieves a 2.4x improvement in aggregation time as part of that. The intermittent even queue stalls are perhaps more troubling to users.

@jdmarshall
Copy link
Contributor Author

jdmarshall commented Jan 4, 2026

On the topic of osMemoryHeap:

The problem with readFile is that it blocks the event queue. There's a big comment in the code about why that's okay.

The alternative is to call fs/promises.readFile, which doesn't block but is also orders of magnitude slower. When you're trying to juggle a bunch of operations in a busy NodeJS container that's generally a good thing, but the penalty for doing so is extremely high latency.

Running the benchmark in a docker container:

Summary (vs. baseline):
default metrics ⇒ osMemoryHeap ⇒ collect ⇒ prom-client@trunk (baseline)
default metrics ⇒ osMemoryHeap ⇒ collect ⇒ prom-client@current (17.78x slower)

So the wall clock time goes from 12 µs to 227 µs, which is not terrible if it reduces blocking for other operations. The user and system time are both about the same that feels like the amount of effort is about even, it's just that benchmarks don't work well for this sort of async code.

What I'm going to do is include the benchmarks I wrote in this branch, but we can discuss the osMemoryHeapLinux changes as its own PR. Might be worth doing that at the same time the async collect changes. To see if it helps or exacerbates the situation.

Setup to see if the osMemoryHeap is a hot path as reported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants