Techempower benchmark spends 4% of time in volatile field access? #30473

jfrantzius · 2023-01-19T10:35:03Z

jfrantzius
Jan 19, 2023

Hi @franz1981 , in the hope that this won't be another public exercise of a flawed test setup, here is no. 4 of the hotspots reported by JProfiler async profiling (on a Linux x86 host) of the Techempower benchmark:

Something similar shows up with async-profiler -e cycles :

This is actually a field access of a volatile field, i.e. it's not cached in CPU registers or L1 - L3 caches. Do you perhaps see something similar in your flamegraphs?

This looks like some kind of memory cleanup (Entry.recycle()) happening in threads that write responses. First thing I'd wonder is whether this could be offloaded to happen asynchronously, so the writing threads aren't blocked on this.

franz1981 · 2023-01-20T12:26:12Z

franz1981
Jan 20, 2023

Yes, I am actually one of the JCTools developer and it's due to skidding. Mpsc offering makes uses of compare and set operations that, on x86, make uses of full barrier (in the form of lock prefixed ASM instructions), that's rather costy and interact with the store buffer of CPUs.

You can use perf itself and/or intel Vtune (now freely available) and it will show that the cost is "near" such instructions (right after it), and sadly the profiler can just blame the nearest Byte Code Index can find by looking forward (perf java map will likely blame the previous one and it won't use async get call trace, but the real frame pointer, maybe still missing the right istr to blame).

A similar issue happened here too: JCTools/JCTools#288

In async profiler you can have some fun with some finer grain investigation (still considering the skid + inlining async get call trace limitations) by using a JFR output format and converting it using the ap jfr2flame converter using the --lines option, that will produce a flamegraphs using the relevant line of code instead of just methods (wrong, as previously said).

0 replies

jfrantzius · 2023-01-20T16:51:23Z

jfrantzius
Jan 20, 2023
Author

I was a little confused because there is a getter method lvProducerIndex(), but without the usual get prefix. It only returns the volatile field:

Do you think async-profiler and JProfiler are both wrong in attributing so much CPU time to lvProducerIndex() ? It made sense to me, as the getter is accessing a volatile field, and if that happens frequently enough it might well have an impact.

At any rate, if Entry.recycle() has an impact as reported (whatever skidding and BCI confusion might happen further down in the call chain), then my suggestion would be to offload that line 377 from MemoryRegionCache.allocate() onto some separate thread, so it doesn't block the current thread that is writing the HTTP response. WDYT?

1 reply

franz1981 Jan 21, 2023

I think they are both wrong and I don't think using other threads will make any difference because:

such threads will likely be idle most of the time and offering them work to do will cause both sides to pay the awakening cost
offering work to them will still imply some form of publishing involving the same concurrent atomic primitives used by the JCtools q (in short: no difference)

In some old version of Netty we used a different mechanism to recycle objs, making uses of batched thread local offers, way less costly, but with some wider effects on GC behaviour, then abandoned.

jfrantzius · 2023-01-25T18:14:31Z

jfrantzius
Jan 25, 2023
Author

Hi @franz1981 , with the TSC clocksource now working, I re-ran with async-profiler a couple of times. In each flamegraph, asm_sysvec_apic_timer_interrupt turns up in a different place, i.e. seemingly being invoked by a different Java method with ~3% time attributed to a different method in each run. I guess that's some interrupt code that async-profiler regards as part of thread execution, while it isn't?

So I guess if I find anything that contains asm_sysvec_apic_timer_interrupt, it can be ignored. WDYT?

1 reply

franz1981 Jan 26, 2023

The timer interrupts come from APIC IRQ handling (that's specific per CPU) while they "seems" (they could or not, really, due to skidding, none knows) to be on CPU, but it happens regardless what the program is doing in user-space, it's not "provoked" by the user call-stack recorded.
IF the APIC timer is issued (called local timer too) on a specific method is due to statistical reasons: a "slow" method or a frequent one, have more chances to be there while the APIC timer IRQ handling requires to interrupt the user code, that's why you see it there.
APIC timers should be used by the hrtimer setup by perf (by async profiler engine, actually) in order to collect call chain perf events: I usually don't notice them unless the collection frequency is high, there's some quirks related cpu frequency variance and/or number of threads sampled/profiled isn't "too high": I suggest to open an issue on async profiler asking why you see them and if it means anything in term of profiling overhead

jfrantzius · 2023-01-26T09:29:17Z

jfrantzius
Jan 26, 2023
Author

Alright, I asked them here: asm_sysvec_apic_timer_interrupt attributed wrongly to arbitrary stack trace? #710

1 reply

jfrantzius Jan 26, 2023
Author

Alright, it turns out I was confused by flamegraph always showing details of asm_sysvec_apic_timer_interrupt, while HTML truncates them.

Please see below comment below for continued discussion.

jfrantzius · 2023-01-26T09:39:15Z

jfrantzius
Jan 26, 2023
Author

@franz1981 concerning Netty #13153, did you consider that my original observation here might be wrong, as asm_sysvec_apic_timer_interrupt turns up in the native stack trace below lvProducerLimit? Or did you also observe it yourself?

1 reply

franz1981 Jan 26, 2023

lock cmqxchg is the likely real cause of the cost and, as said, its caller doesn't popup due to skidding (that's why having domain knowledge is both a bias and a useful skill here): my point on the Netty issue is to avoid using expensive lock-free offers while not required ie when event loop acquire and release a pooled resource. Sadly it won't help your case where probably such cost happen due to the resource acquired in the blocking thread pool and released in the event loop (hence 2 different threads)

jfrantzius · 2023-01-26T10:27:59Z

jfrantzius
Jan 26, 2023
Author

With TSC and async profiling, these are now the hotspots as reported by JProfiler (replacing the first screenshot I posted here):

2 replies

franz1981 Jan 26, 2023

I cannot see any apic related stack trace, or am I reading it wrong?

jfrantzius Jan 26, 2023
Author

It's in the second screenshot of my first posting, where I (tried) to show related observation from async-profiler HTML dump

jfrantzius · 2023-01-26T20:21:07Z

jfrantzius
Jan 26, 2023
Author

Hi @franz1981 , across a good number of flamegraphs and HTML dumps that I took, io/netty/buffer/PoolThreadCache$MemoryRegionCache.allocate consistently has 3% - 7% of time, which async-profiler just can't attribute exactly within the call trees below?

0 replies

jfrantzius · 2023-01-30T18:22:26Z

jfrantzius
Jan 30, 2023
Author

Hi @franz1981 , as pointed out in the async-profiler discussion, all three findings are around volatile field access (for details please see there).

This is the flamegraph that found two of them at the same time, amounting to 7.18% in MemoryRegionCache.allocate():

1 reply

franz1981 Jan 31, 2023

As said, I won't trust that much exact definition of the guilty functions here, given that there is a lot of inlining and there are full memory barrier-like operations in place.
You can forcibly use compiler directive to disable inlining on specific methods (being aware of the performance impact) to make life easier to the profiler or just go deeper in the rabbit hole, by using https://builds.shipilev.net/perfasm/ to produce a report with assembly annotated that will make easier to find the guilty methods (and life sad because assembly is rarely a pleasure;)).
Said that, there is nothing unexpected and one of Netty maintainer will take care of checking how to improve the Recycler for common usages (from/to event loop).
In your case, it won't help, but blocking applications usually are not easy to be made mechanical sympathy, especially while mixing with non-blocking ones eg the problem related capacity planning, sizing threads, sizing queues of blocking executor, cache misses while moving refs between event loop <-> blocking threads.

Techempower benchmark spends 4% of time in volatile field access? #30473

Uh oh!

jfrantzius Jan 19, 2023

Replies: 8 comments · 7 replies

Uh oh!

Uh oh!

franz1981 Jan 20, 2023

Uh oh!

Uh oh!

jfrantzius Jan 20, 2023 Author

Uh oh!

Uh oh!

franz1981 Jan 21, 2023

Uh oh!

Uh oh!

jfrantzius Jan 25, 2023 Author

Uh oh!

Uh oh!

franz1981 Jan 26, 2023

Uh oh!

jfrantzius Jan 26, 2023 Author

Uh oh!

jfrantzius Jan 26, 2023 Author

Uh oh!

jfrantzius Jan 26, 2023 Author

Uh oh!

franz1981 Jan 26, 2023

Uh oh!

jfrantzius Jan 26, 2023 Author

Uh oh!

franz1981 Jan 26, 2023

Uh oh!

jfrantzius Jan 26, 2023 Author

Uh oh!

jfrantzius Jan 26, 2023 Author

Uh oh!

jfrantzius Jan 30, 2023 Author

Uh oh!

Uh oh!

franz1981 Jan 31, 2023

jfrantzius
Jan 19, 2023

Replies: 8 comments 7 replies

franz1981
Jan 20, 2023

jfrantzius
Jan 20, 2023
Author

jfrantzius
Jan 25, 2023
Author

jfrantzius
Jan 26, 2023
Author

jfrantzius Jan 26, 2023
Author

jfrantzius
Jan 26, 2023
Author

jfrantzius
Jan 26, 2023
Author

jfrantzius Jan 26, 2023
Author

jfrantzius
Jan 26, 2023
Author

jfrantzius
Jan 30, 2023
Author