Mcollina perf improvements migration#738
Conversation
a3f2ce7 to
39af207
Compare
|
Additional notes: TDigest.push() is consistently testing slower but all of the other functions are faster. I don't know if that means it should be in, or out. Eliminating the getMetricsAsArray() call doesn't have a measurable improvement in performance, but it does reduce allocations so I retained it. Empty Set checks also tangle with another PR I already got merged, so that gets watered down too. Update: After further consideration, push() is the hot path and making it slower is a bad idea. Dropped. |
continuing to attempt to merge Matteo's commits. Remove kludge used in benchmark-regression code to see into the utils. Faceoff has a better way to handle this.
- Refactor escapeLabelValue() and escapeString() to single-pass traversal - Eliminate multiple .replace() calls in string escaping Performance improvements: - ~4% improvement in registry.metrics() serialization (3,160 -> 3,298 calls/sec) - Reduced average time per call from 0.316ms to 0.303ms - More efficient string processing with switch-based character escaping 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Modified by Jason Marshall <jdmarshall@users.noreply.github.com>
Replace this.getMetricsAsArray() with direct iteration over this._metrics.values() to eliminate unnecessary Array.from() conversion. Performance improvement: ~1.3% faster (3,216 -> 3,259 calls/sec) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
57177de to
925fb43
Compare
|
A silver lining of this effort is that while trying to conserve the bintree and tdigest benchmarks @mcollina wrote, I ported them to faceoff so that we could get deltas between branches. That effort surfaced a rather substantial ergonomics problem with subsuites and I've opened a ticket to make that better. |
…st migration Comprehensive changelog update including: - Performance optimizations (promise allocation, array operations, histogram, tdigest) - Various bug fixes and refactoring Covers commits from f6dc1a3 to 4d589c6 (17 commits total). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Modified by: Jason Marshall <jdmarshall@users.noreply.github.com>
925fb43 to
34fd4af
Compare
I do not recall exactly, but I would need to analyze again.
This is actual one of the most significant optimizations.
Yes. This is critical to improve performance.
Most of those changes were measured in real apps as well.
It was showing up in the flamegraph as a hot operation. Moving to async it made it disappear. Easy choice.
This was needed to fix the keyFrom changes.
We are not in agreement.
This is a critical optimization. |
|
I would encourage you to file findBpunds() as its own PR, but with pinning tests and fixing the issues with the existing tests. Which you didn’t see to be running as you went along? 61 files is too much for a single PR. And red builds cannot be merged regardless. I agree that it’s an important change. I wrote #671.
I will have to test this and get back to you. You also have to be careful you aren’t triggering dead code elimination in benchmarks but that’s probably not what is happening here.
I’ve tested for loops versus list comprehensions extensively in NodeJS. Particularly under benchmark.js, in production code, in production batch processing, and with tinybench. Yes they used to be quite slow especially in hot paths. Part of that has been fixed in later v8 versions. Part of it is fixed by using arrow functions. Part of it is fixed by not touching If the loop code were having a large effect you would see it show up under the aggregate tests like summary and registry and I’m just not seeing it under bench-node. If you use trunk as the baseline instead of latest, it’s easier to see what’s new versus whats already landed. My MO is to make all of the performance optimizations that also improve the code quality, regardless of how well or poorly they benchmark. 2% gains done a dozen times is still a 25% improvement and I will take that with few caveats. Nobody can really argue with better code even if the 2% seems like a waste of my time. What matters though is how much QA is necessary to validate the changes and for that you restrict your changes to one functional area at a time instead of the entire codebase. As a general rule I avoid In batch processing unrolling a map() to a https://benchmarklab.azurewebsites.net/Benchmarks/Show/34274/1/for-in-loop-vs-map#latest_results_block be sure to look at only the Chrome results. Since that’s what’s relevant to nodeJS performance. But I’m seeing similar results in Safari which would cover Bun. Stats should ideally be 5% of your application workload and no more. So 2% of 5% is lost in the noise floor. I had a large code base where the cross cutting concerns, including stats, was closer to 20%, and I did enough work to find 90 ms in response time. I didn’t have to murder any |
|
As far as I’m aware prom-client makes no guarantees about what order stats will be evaluated in and in fact #692 alters the order they are collected and achieves a 2.4x improvement in aggregation time as part of that. The intermittent even queue stalls are perhaps more troubling to users. |
|
On the topic of osMemoryHeap: The problem with readFile is that it blocks the event queue. There's a big comment in the code about why that's okay. The alternative is to call Running the benchmark in a docker container:
So the wall clock time goes from 12 µs to 227 µs, which is not terrible if it reduces blocking for other operations. The user and system time are both about the same that feels like the amount of effort is about even, it's just that benchmarks don't work well for this sort of async code. What I'm going to do is include the benchmarks I wrote in this branch, but we can discuss the osMemoryHeapLinux changes as its own PR. Might be worth doing that at the same time the async collect changes. To see if it helps or exacerbates the situation. |
Setup to see if the osMemoryHeap is a hot path as reported.
This represents a cherry-pick of the changes in #727 with a few modifications to separate the testing changes and account for PRs of mine that already merged.
In most cases I preserved @mcollina 's commits, in some cases with deletions to split problem code from otherwise serviceable code.
Unfortunately this has removed:
push()- false economyConclusion
I think there are two or three commits that warrant their own pull requests to discuss the merits of each as a separate commit. Migrating out of Jest is interesting but not something that should be done by Claude as it is clearly not doing any red-green-refactoring on the tests to validate that they are still testing anything. Also there are 5 more months of Node 20 to deal with before
describe()works. And asserts have the worst DX of any matcher library. I would use chai.expect instead (and have on other projects).What is retained here are the things that have no obvious bugs and improved the code quality, regardless of whether they had much net effect on performance.
Special mention:
optimizing for empty labels.
This is a bad plan.
Stats with no labels are already many times faster than stats with labels (though the benchmarks misreport it as 600x, it's much smaller), and before I did the label code rework that multiplier was much higher. Since prometheus really is fairly useless with stats with zero labels (not even server name or app version??), there is absolutely no value in my mind in making 0-labels 25% faster if it makes stats with labels 1% slower. You will have thousands of the latter and a handful of the former.
I looked at this several times and opted to go the opposite direction to speed up labels, which the release version is IIRC more like 900x slower. Matteo did not quite have all of these changes when he started his.
Net effect
Once you factor in some changes that Matteo noticed but which were already in my other pull requests that have since merged, removing code with bugs, and importantly after you switch to using trunk as the baseline to eliminate cross-branch confusion, the net effect here is small. But these are the changes that seemed like they might help in real world scenarios instead of just in microbenchmarks.
Final outcome, with latest removed to reduce confusion. Mostly in the noise floor with a few possible % here and there.