Skip to content

Conversation

@orlp
Copy link
Contributor

@orlp orlp commented Jul 5, 2025

This is an extension of the performance improvements seen from #141685. I noticed that the non-const TLS still didn't have the #[cold] attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (init) or never (panic, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to #[cold] I added the more aggressive #[inline(never)] to both cold paths as well.

@rustbot
Copy link
Collaborator

rustbot commented Jul 5, 2025

r? @workingjubilee

rustbot has assigned @workingjubilee.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jul 5, 2025
@compiler-errors
Copy link
Member

Not sure if this will show up at all on perf but 🤷

@bors2 try @rust-timer queue

Do you have any local benchmarks?

@rust-timer

This comment has been minimized.

rust-bors bot added a commit that referenced this pull request Jul 5, 2025
Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 5, 2025
@rust-bors
Copy link

rust-bors bot commented Jul 5, 2025

⌛ Trying commit db7b096 with merge 9f2c18a

To cancel the try build, run the command @bors2 try cancel.

@orlp
Copy link
Contributor Author

orlp commented Jul 5, 2025

@compiler-errors No I don't have any local benchmarks. But I look at assembly output a lot, and trust me when I say these code paths should never get inlined.

Could you restart the benchmark with my second commit included?

@compiler-errors
Copy link
Member

@bors2 try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors
Copy link

rust-bors bot commented Jul 5, 2025

⌛ Trying commit cf4669e with merge 8b17150

(The previously running try build was automatically cancelled.)

To cancel the try build, run the command @bors2 try cancel.

rust-bors bot added a commit that referenced this pull request Jul 5, 2025
Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@rust-bors
Copy link

rust-bors bot commented Jul 6, 2025

☀️ Try build successful (CI)
Build commit: 8b17150 (8b17150009e237f23856ea93eb9b208049d8a621, parent: 175e04331be56c5b4bdf77478434b1a5e0556770)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (8b17150): comparison URL.

Overall result: ❌✅ regressions and improvements - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 1
Improvements ✅
(primary)
-0.3% [-0.3%, -0.3%] 1
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 1
All ❌✅ (primary) -0.3% [-0.3%, -0.3%] 1

Max RSS (memory usage)

Results (primary 5.4%, secondary 2.4%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
5.4% [4.3%, 7.1%] 3
Regressions ❌
(secondary)
2.4% [2.4%, 2.4%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 5.4% [4.3%, 7.1%] 3

Cycles

Results (primary 2.6%, secondary -2.8%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
2.6% [2.6%, 2.6%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.8% [-2.8%, -2.8%] 1
All ❌✅ (primary) 2.6% [2.6%, 2.6%] 1

Binary size

Results (primary 0.0%, secondary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.5%] 15
Regressions ❌
(secondary)
0.1% [0.0%, 0.1%] 37
Improvements ✅
(primary)
-0.2% [-0.7%, -0.0%] 5
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [-0.7%, 0.5%] 20

Bootstrap: 459.09s -> 461.518s (0.53%)
Artifact size: 372.18 MiB -> 372.13 MiB (-0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 6, 2025
@orlp
Copy link
Contributor Author

orlp commented Jul 6, 2025

I removed some inline(never)s because they pessimized codegen. I had forgotten that the get() call which returns the TLS pointer still gets wrapped inside LocalKey and checked again to see if a panic is required. Now this PR only adds hot paths with #[cold] for the fallback.

Codegen is still nicer just due to the addition of #[cold], it moves the initialization out of the hot path at least (and the compiler may still decide to not inline it).

@lqd
Copy link
Member

lqd commented Jul 6, 2025

@bors2 try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors
Copy link

rust-bors bot commented Jul 6, 2025

⌛ Trying commit 92fa8e8 with merge 9782d0a

To cancel the try build, run the command @bors2 try cancel.

rust-bors bot added a commit that referenced this pull request Jul 6, 2025
Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 6, 2025
@rust-bors
Copy link

rust-bors bot commented Jul 6, 2025

☀️ Try build successful (CI)
Build commit: 9782d0a (9782d0a1d99759de86b20e0863061637a0a3c245, parent: c83e217d268d25960a0c79c6941bcb3917a6a0af)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9782d0a): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 2
All ❌✅ (primary) - - 0

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.0%, secondary 0.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.0% [0.0%, 0.0%] 1
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 9
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.0% [-0.0%, -0.0%] 1
All ❌✅ (primary) 0.0% [0.0%, 0.0%] 1

Bootstrap: 461.809s -> 462.209s (0.09%)
Artifact size: 372.19 MiB -> 372.13 MiB (-0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 6, 2025
}

#[cold]
unsafe fn initialize(&self) -> *const T {
Copy link
Member

@ibraheemdev ibraheemdev Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be marked as #[inline(never)]? I've noticed destructors::register calls in the hot-path before, for const thread-locals with destructors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that in an earlier PR, see #143511 (comment).

#[derive(Clone, Copy)]
enum State {
Initial,
Uninitialized,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uninitialized isn't a great name, the value itself is very much initialised. It's just the destructor that's not registered yet.

/// The resulting pointer may not be used after reentrant inialialization
/// or thread destruction has occurred.
#[inline]
pub fn get(&'static self, i: Option<&mut Option<T>>, f: impl FnOnce() -> T) -> *const T {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, I think it might be beneficial to inline the ptr.addr() == 1 case into this function, as that might yield more optimized LocalKey::withs.

Comment on lines +33 to +36
if let State::Alive = self.state.get() {
self.val.get()
} else {
unsafe { self.get_or_init_slow() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is beneficial – the returned pointer is later compared against null in LocalKey::with anyway, so the optimiser should be able to merge the state comparison into that.

@workingjubilee
Copy link
Member

whoops, didn't mean for this to slip, but I espy a nominee

r? @joboet

@rustbot rustbot assigned joboet and unassigned workingjubilee Sep 9, 2025
@joboet joboet added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants