kernel: speed up z_stack_space_get #80391

felixturgeonmeta · 2024-10-24T18:30:50Z

The z_stack_space_get call currently checks for free space at the top of the stack by checking each byte individually. This can introduce significant runtime overhead for threads which have large, mostly unused stacks. This change updates the check to first count the free space by word, and then check the sub-word unused bytes.

andyross

Looks correct, but devil's advocacy: do we actually care about this API being fast? I mean, generally this is going to be some kind of debug or auditing hook, etc... No one should have this kind of analysis on a hot path. I guess my gut says that this is just wasting code bytes for no value?

felixturgeonmeta · 2024-10-24T18:53:37Z

No one should have this kind of analysis on a hot path. I guess my gut says that this is just wasting code bytes for no value?

We do, which is the impetus for this change. Our systems monitor thread stack usage at a relatively high frequency with a lot of threads present.

npitre

This could be improved further for 64-bit architectures simply by using
an unsigned long rather than a uint32_t. The marker should then be
(unsigned long)0xaaaaaaaaaaaaaaaa with the explicit cast so not to cause
a compiler warning on 32-bit systems.

TaiJuWu · 2024-10-24T22:43:12Z

kernel/thread.c

Should we abstract it out as macro?

peter-mitsis · 2024-10-28T22:57:26Z

kernel/thread.c

Since it was indicated that speed is important, I would suggest doing the unused calculation at the end of the loop. That is, do it once rather than every iteration of the loop. Maybe the compiler will do that for us, but I tend to be somewhat pessimistic about compiler smarts.

peter-mitsis · 2024-10-28T23:11:08Z

kernel/thread.c

The stack on a 64-bit platform has at least 8-byte memory alignment. Adding 4 bytes to checked_stack in the CONFIG_STACK_SENTINEL case in the preceding code block will change checked_stack to be 4-byte aligned. This has the potential to generate an alignment exception when reading a 64-bit value from a 32-bit aligned memory address. Whether such an exception would occur would be both architecture and compiler dependent.

npitre · 2024-10-29T21:24:56Z

I'm somewhat perplexed by the latest push.

What is this CONFIG_STACK_ALIGN_DOUBLE_WORD about here? This appears to
be an ARM32-only config symbol that has nothing to do with the CPU word
size.

Then, if CONFIG_STACK_SENTINEL is enabled, you advance checked_stack but
not checked_stack_word, meaning that the first checked_stack_word item
will still contain the sentinel value. On the first loop iteration with i
equal to 0 the if condition will fail and this line will be executed:

        unused += ((i - 1) * STACK_WORD_SIZE);

Whi the -1 here? Especially with i == 0 and size_t being unsigned,
you'll end up with a gigantic unused value.

In other words, this is doubly broken code.

Here's what I initially suggested instead:

diff --git a/kernel/thread.c b/kernel/thread.c
index 69728a403d9..3d40537bcd7 100644
--- a/kernel/thread.c
+++ b/kernel/thread.c
@@ -853,6 +853,23 @@ int z_stack_space_get(const uint8_t *stack_start, size_t size, size_t *unused_pt
 		size -= 4;
 	}
 
+	/* align to word boundary */
+	const unsigned long *checked_stack_word =
+		(const unsigned long *)ROUND_UP(checked_stack, sizeof(unsigned long));
+	size -= ((const uint8_t *)checked_stack_word - checked_stack);
+
+	/* compare using native machine word size */
+	for (size_t i = 0; i < size/sizeof(unsigned long); i++) {
+		if (checked_stack_word[i] == (unsigned long)0xaaaaaaaaaaaaaaaa) {
+			unused += sizeof(unsigned long);
+		} else {
+			break;
+		}
+	}
+	checked_stack = (const uint8_t *)checked_stack_word;
+	size -= unused;
+
+	/* compare remaining bytes */
 	for (size_t i = 0; i < size; i++) {
 		if ((checked_stack[i]) == 0xaaU) {
 			unused++;

felixturgeonmeta · 2024-10-29T22:03:22Z

@npitre I tried to do 3 things in one shot and it just scrambled everything.

I update the loop to be as you suggested.
I update the off-by-one error in the "unused" math. Since i is zero based, the -1 is incorrect. This should address @peter-mitsis' comment about optimizing further for speed.

npitre · 2024-10-30T03:24:54Z

Still not right.

First, this:

    /* If no usage is detected, then the whole stack is unused, set unused to size */
    if (unused == 0) {
        unused = size;
    }

I don't understand the logic behind the above.

Then you do:

    /* Continue checking from last used word to find remaining unused bytes*/
    size -= unused;
    checked_stack = stack_start + unused;

This is wrong. You are ignoring the sentinel flag presence.
It should be as I suggested earlier.

felixturgeonmeta · 2024-10-30T16:07:18Z

@npitre I copy-pasted your exact comment and updated.

Still not right.

First, this:

    /* If no usage is detected, then the whole stack is unused, set unused to size */
    if (unused == 0) {
        unused = size;
    }

This is to account for a completely empty stack since the loop does not increment "unused" 1-word at a time and only calculates it based on where the first "used" word is found. If a stack contains no used words, then we need to update unused to be the full size of the stack.

I don't understand the logic behind the above.

Then you do:
    /* Continue checking from last used word to find remaining unused bytes*/
    size -= unused;
    checked_stack = stack_start + unused;
This is wrong. You are ignoring the sentinel flag presence. It should be as I suggested earlier.

I am not sure what you mean here, the sentinel is accounted for above the changed.

npitre · 2024-10-30T16:35:04Z

This is to account for a completely empty stack since the loop does not
increment "unused" 1-word at a time and only calculates it based on
where the first "used" word is found. If a stack contains no used words, then we need to update unused to be
the full size of the stack.

Sorry, this makes no sense.

felixturgeonmeta · 2024-10-30T18:45:05Z

@peter-mitsis said:

Since it was indicated that speed is important, I would suggest doing the unused calculation at the end of the loop. That is, do it once rather than every iteration of the loop. Maybe the compiler will do that for us, but I tend to be somewhat pessimistic about compiler smarts.

So I changed the loop from:

for (size_t i = 0; i < word_size; i++) {
		if ((checked_stack_word[i]) == unused_pattern) {
			unused += unused_pattern_size;
		}
}

to:

for (size_t i = 0; i < word_size; i++) {
		if (checked_stack_word[i] != (unsigned long)0xaaaaaaaaaaaaaaaa) {
			unused += (i * sizeof(unsigned long));
			break;
		}
}

In this case, if the whole stack is unused, then unused will not be updated during the check, and must be updated after the loop to be the full size of the stack.

npitre · 2024-10-30T19:14:25Z

In this case, if the whole stack is unused, then unused will not be
updated during the check, and must be updated after the loop to be
the full size of the stack.

But if the entire stack is used, unusedwill be legitimately be 0, yet
you'll still return a result meaning the full size of the stack is unused.

The best idiomatic way to do this is:

    size_t i;
    for (i = 0; i < word_size; i++) {
        if (checked_stack_word[i] != (unsigned long)0xaaaaaaaaaaaaaaaa) {
            break;
        }
    }
    unused += i * sizeof(unsigned long);

About the sentinel issue, this is wrong:

    checked_stack = stack_start + unused;

Instead, you need:

    check_stack = (const uint8_t *)checked_stack_word + unused;

or:

    check_stack = (const uint8_t *)&checked_stack_word[i];

as stack_start doesn't reflect the sentinel adjustment.

The z_stack_space_get call currently checks for free space at the top of the stack by checking each byte individually. This can introduce significant runtime overhead for threads which have large, mostly unused stacks. This change updates the check to first count the free space by word, and then check the sub-word unused bytes. Signed-off-by: Félix Turgeon <[email protected]>

cfriedt · 2024-11-01T00:12:47Z

Since there is a lot of talk about speed optimization here, it would be good to measure and report numbers.

github-actions · 2024-12-31T00:33:01Z

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

cfriedt · 2024-12-31T10:52:43Z

The thing I find confusing is that mostly all of the algorithms mentioned here are still O(n) - i.e. sub-optimal.

If we're really concerned with speed, why not perform a binary search of the stack space (minus the stack sentinel) for the first non-pattern word, potentially making adjustments for a few bytes at the end? That reduces the time-complexity down to O(log n) which is at least closer to optimal.

Edit: I guess because the pattern used to fill unused stack space could theoretically end up anywhere in the stack.

npitre · 2025-01-01T17:59:44Z

If we're really concerned with speed, why not perform a binary search of the stack space (minus the stack sentinel) for the first non-pattern word, potentially making adjustments for a few bytes at the end? That reduces the time-complexity down to `O(log n)` which is at least closer to optimal.

Can't do. The stack is not used fully in a contigous manner due to alignment requirements, etc. This means you're likely to land on still untouched spots here and there with the original pattern in the middle of the actually used stack space.

github-actions · 2025-03-03T00:36:09Z

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

zephyrbot added the area: Kernel label Oct 24, 2024

zephyrbot requested review from TaiJuWu, andyross, ceolin, cfriedt, dcpleung, nashif, npitre and peter-mitsis October 24, 2024 18:31

zephyrbot assigned andyross and peter-mitsis Oct 24, 2024

andyross reviewed Oct 24, 2024

View reviewed changes

npitre requested changes Oct 24, 2024

View reviewed changes

felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch 2 times, most recently from 3262637 to babfb68 Compare October 24, 2024 20:25

felixturgeonmeta requested a review from npitre October 24, 2024 20:29

TaiJuWu reviewed Oct 24, 2024

View reviewed changes

kernel/thread.c Outdated

Copy link

Member

TaiJuWu Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we abstract it out as macro?

peter-mitsis requested changes Oct 28, 2024

View reviewed changes

felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from babfb68 to b1bb1c6 Compare October 29, 2024 18:35

felixturgeonmeta requested a review from peter-mitsis October 29, 2024 20:59

felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from b1bb1c6 to 8962635 Compare October 29, 2024 22:01

felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from 8962635 to f46d088 Compare October 31, 2024 18:23

felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from f46d088 to 2074c00 Compare October 31, 2024 20:26

github-actions bot added the Stale label Dec 31, 2024

github-actions bot removed the Stale label Jan 1, 2025

github-actions bot added the Stale label Mar 3, 2025

github-actions bot closed this Mar 17, 2025

kernel: speed up z_stack_space_get #80391

kernel: speed up z_stack_space_get #80391

Uh oh!

Conversation

felixturgeonmeta commented Oct 24, 2024

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

felixturgeonmeta commented Oct 24, 2024

Uh oh!

npitre left a comment

Choose a reason for hiding this comment

Uh oh!

TaiJuWu Oct 24, 2024

Choose a reason for hiding this comment

Uh oh!

peter-mitsis Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

peter-mitsis Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

npitre commented Oct 29, 2024

Uh oh!

felixturgeonmeta commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npitre commented Oct 30, 2024

Uh oh!

felixturgeonmeta commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npitre commented Oct 30, 2024

Uh oh!

felixturgeonmeta commented Oct 30, 2024

Uh oh!

npitre commented Oct 30, 2024

Uh oh!

cfriedt commented Nov 1, 2024

Uh oh!

github-actions bot commented Dec 31, 2024

Uh oh!

cfriedt commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npitre commented Jan 1, 2025 via email

Uh oh!

github-actions bot commented Mar 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

felixturgeonmeta commented Oct 29, 2024 •

edited

Loading

felixturgeonmeta commented Oct 30, 2024 •

edited

Loading

cfriedt commented Dec 31, 2024 •

edited

Loading