Skip to content

Conversation

@felixturgeonmeta
Copy link
Contributor

The z_stack_space_get call currently checks for free space at the top of the stack by checking each byte individually. This can introduce significant runtime overhead for threads which have large, mostly unused stacks. This change updates the check to first count the free space by word, and then check the sub-word unused bytes.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct, but devil's advocacy: do we actually care about this API being fast? I mean, generally this is going to be some kind of debug or auditing hook, etc... No one should have this kind of analysis on a hot path. I guess my gut says that this is just wasting code bytes for no value?

@felixturgeonmeta
Copy link
Contributor Author

No one should have this kind of analysis on a hot path. I guess my gut says that this is just wasting code bytes for no value?

We do, which is the impetus for this change. Our systems monitor thread stack usage at a relatively high frequency with a lot of threads present.

Copy link

@npitre npitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be improved further for 64-bit architectures simply by using
an unsigned long rather than a uint32_t. The marker should then be
(unsigned long)0xaaaaaaaaaaaaaaaa with the explicit cast so not to cause
a compiler warning on 32-bit systems.

@felixturgeonmeta felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch 2 times, most recently from 3262637 to babfb68 Compare October 24, 2024 20:25
kernel/thread.c Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we abstract it out as macro?

kernel/thread.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it was indicated that speed is important, I would suggest doing the unused calculation at the end of the loop. That is, do it once rather than every iteration of the loop. Maybe the compiler will do that for us, but I tend to be somewhat pessimistic about compiler smarts.

kernel/thread.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack on a 64-bit platform has at least 8-byte memory alignment. Adding 4 bytes to checked_stack in the CONFIG_STACK_SENTINEL case in the preceding code block will change checked_stack to be 4-byte aligned. This has the potential to generate an alignment exception when reading a 64-bit value from a 32-bit aligned memory address. Whether such an exception would occur would be both architecture and compiler dependent.

@npitre
Copy link

npitre commented Oct 29, 2024

I'm somewhat perplexed by the latest push.

What is this CONFIG_STACK_ALIGN_DOUBLE_WORD about here? This appears to
be an ARM32-only config symbol that has nothing to do with the CPU word
size.

Then, if CONFIG_STACK_SENTINEL is enabled, you advance checked_stack but
not checked_stack_word, meaning that the first checked_stack_word item
will still contain the sentinel value. On the first loop iteration with i
equal to 0 the if condition will fail and this line will be executed:

        unused += ((i - 1) * STACK_WORD_SIZE);

Whi the -1 here? Especially with i == 0 and size_t being unsigned,
you'll end up with a gigantic unused value.

In other words, this is doubly broken code.

Here's what I initially suggested instead:

diff --git a/kernel/thread.c b/kernel/thread.c
index 69728a403d9..3d40537bcd7 100644
--- a/kernel/thread.c
+++ b/kernel/thread.c
@@ -853,6 +853,23 @@ int z_stack_space_get(const uint8_t *stack_start, size_t size, size_t *unused_pt
 		size -= 4;
 	}
 
+	/* align to word boundary */
+	const unsigned long *checked_stack_word =
+		(const unsigned long *)ROUND_UP(checked_stack, sizeof(unsigned long));
+	size -= ((const uint8_t *)checked_stack_word - checked_stack);
+
+	/* compare using native machine word size */
+	for (size_t i = 0; i < size/sizeof(unsigned long); i++) {
+		if (checked_stack_word[i] == (unsigned long)0xaaaaaaaaaaaaaaaa) {
+			unused += sizeof(unsigned long);
+		} else {
+			break;
+		}
+	}
+	checked_stack = (const uint8_t *)checked_stack_word;
+	size -= unused;
+
+	/* compare remaining bytes */
 	for (size_t i = 0; i < size; i++) {
 		if ((checked_stack[i]) == 0xaaU) {
 			unused++;

@felixturgeonmeta felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from b1bb1c6 to 8962635 Compare October 29, 2024 22:01
@felixturgeonmeta
Copy link
Contributor Author

felixturgeonmeta commented Oct 29, 2024

@npitre I tried to do 3 things in one shot and it just scrambled everything.

I update the loop to be as you suggested.
I update the off-by-one error in the "unused" math. Since i is zero based, the -1 is incorrect. This should address @peter-mitsis' comment about optimizing further for speed.

@npitre
Copy link

npitre commented Oct 30, 2024

Still not right.

First, this:

    /* If no usage is detected, then the whole stack is unused, set unused to size */
    if (unused == 0) {
        unused = size;
    }

I don't understand the logic behind the above.

Then you do:

    /* Continue checking from last used word to find remaining unused bytes*/
    size -= unused;
    checked_stack = stack_start + unused;

This is wrong. You are ignoring the sentinel flag presence.
It should be as I suggested earlier.

@felixturgeonmeta
Copy link
Contributor Author

felixturgeonmeta commented Oct 30, 2024

@npitre I copy-pasted your exact comment and updated.

Still not right.

First, this:

    /* If no usage is detected, then the whole stack is unused, set unused to size */
    if (unused == 0) {
        unused = size;
    }

This is to account for a completely empty stack since the loop does not increment "unused" 1-word at a time and only calculates it based on where the first "used" word is found. If a stack contains no used words, then we need to update unused to be the full size of the stack.

I don't understand the logic behind the above.

Then you do:

    /* Continue checking from last used word to find remaining unused bytes*/
    size -= unused;
    checked_stack = stack_start + unused;

This is wrong. You are ignoring the sentinel flag presence. It should be as I suggested earlier.

I am not sure what you mean here, the sentinel is accounted for above the changed.

@npitre
Copy link

npitre commented Oct 30, 2024

This is to account for a completely empty stack since the loop does not
increment "unused" 1-word at a time and only calculates it based on
where the first "used" word is found. If a stack contains no used words, then we need to update unused to be
the full size of the stack.

Sorry, this makes no sense.

@felixturgeonmeta
Copy link
Contributor Author

@peter-mitsis said:

Since it was indicated that speed is important, I would suggest doing the unused calculation at the end of the loop. That is, do it once rather than every iteration of the loop. Maybe the compiler will do that for us, but I tend to be somewhat pessimistic about compiler smarts.

So I changed the loop from:

for (size_t i = 0; i < word_size; i++) {
		if ((checked_stack_word[i]) == unused_pattern) {
			unused += unused_pattern_size;
		}
}

to:

for (size_t i = 0; i < word_size; i++) {
		if (checked_stack_word[i] != (unsigned long)0xaaaaaaaaaaaaaaaa) {
			unused += (i * sizeof(unsigned long));
			break;
		}
}

In this case, if the whole stack is unused, then unused will not be updated during the check, and must be updated after the loop to be the full size of the stack.

@npitre
Copy link

npitre commented Oct 30, 2024

In this case, if the whole stack is unused, then unused will not be
updated during the check, and must be updated after the loop to be
the full size of the stack.

But if the entire stack is used, unusedwill be legitimately be 0, yet
you'll still return a result meaning the full size of the stack is unused.

The best idiomatic way to do this is:

    size_t i;
    for (i = 0; i < word_size; i++) {
        if (checked_stack_word[i] != (unsigned long)0xaaaaaaaaaaaaaaaa) {
            break;
        }
    }
    unused += i * sizeof(unsigned long);

About the sentinel issue, this is wrong:

    checked_stack = stack_start + unused;

Instead, you need:

    check_stack = (const uint8_t *)checked_stack_word + unused;

or:

    check_stack = (const uint8_t *)&checked_stack_word[i];

as stack_start doesn't reflect the sentinel adjustment.

@felixturgeonmeta felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from 8962635 to f46d088 Compare October 31, 2024 18:23
The z_stack_space_get call currently checks for free space at the
top of the stack by checking each byte individually. This can
introduce significant runtime overhead for threads which have large,
mostly unused stacks. This change updates the check to first count
the free space by word, and then check the sub-word unused bytes.

Signed-off-by: Félix Turgeon <[email protected]>
@felixturgeonmeta felixturgeonmeta force-pushed the fturgeon/z_stack_space_update branch from f46d088 to 2074c00 Compare October 31, 2024 20:26
@cfriedt
Copy link
Member

cfriedt commented Nov 1, 2024

Since there is a lot of talk about speed optimization here, it would be good to measure and report numbers.

@github-actions
Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Dec 31, 2024
@cfriedt
Copy link
Member

cfriedt commented Dec 31, 2024

The thing I find confusing is that mostly all of the algorithms mentioned here are still O(n) - i.e. sub-optimal.

If we're really concerned with speed, why not perform a binary search of the stack space (minus the stack sentinel) for the first non-pattern word, potentially making adjustments for a few bytes at the end? That reduces the time-complexity down to O(log n) which is at least closer to optimal.

Edit: I guess because the pattern used to fill unused stack space could theoretically end up anywhere in the stack.

@github-actions github-actions bot removed the Stale label Jan 1, 2025
@npitre
Copy link

npitre commented Jan 1, 2025 via email

@github-actions
Copy link

github-actions bot commented Mar 3, 2025

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Mar 3, 2025
@github-actions github-actions bot closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants