Skip to content

Conversation

@wentongwu
Copy link
Contributor

@wentongwu wentongwu commented Mar 18, 2019

tests/drivers/ipm: support ipm test case in SMP

Current design of ipm test doesn't consider SMP, there is the
possibility that offload_irq and rx_thread run at different
cpu core at the same time, it will cause state(enabled and
channel_disable) disorder if two different context access them concurrently.

Fixes #12478.

Signed-off-by: Wentong Wu <[email protected]>

@zephyrbot
Copy link

zephyrbot commented Mar 18, 2019

All checks are passing now.

Review history of this comment for details about previous failed status.
Note that some checks might have not completed yet.

@codecov-io
Copy link

codecov-io commented Mar 18, 2019

Codecov Report

Merging #14619 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14619      +/-   ##
==========================================
+ Coverage   52.93%   52.94%   +0.01%     
==========================================
  Files         309      309              
  Lines       45251    45251              
  Branches    10447    10447              
==========================================
+ Hits        23953    23958       +5     
+ Misses      16533    16531       -2     
+ Partials     4765     4762       -3
Impacted Files Coverage Δ
subsys/testsuite/ztest/src/ztest.c 72.72% <0%> (+4.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf91f8c...bbb6d41. Read the comment docs.

Copy link
Contributor

@andrewboie andrewboie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my comments, the git history needs to be cleaned up, we have a merge commit and 2 patches with the same commit message here. You need to interactively rebase.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks. Broadly though, I don't understand why this fixes what it says it fixes. How exactly was this broken in SMP? The commit message should have an explanation.

@wentongwu
Copy link
Contributor Author

@andyross @andrewboie could you please help review the new code? thanks

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nitpick about volatile, one spot in the SPIN_VALIDATE predication that definitely needs to be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong spot. There are plenty of uses for a ring buffer that don't involve contending with SMP CPUs. Instead, I think you probably want to tag the pointer to the struct volatile in the app instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Thanks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hate to stick my nose in here, but is volatile even the right thing here? If the issue is that other CPUs might change the value, then I'm not convinced that we don't also need explicit storage fences. Probably not a problem on x86, but on other arches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On architectures where it matters, the fence will be provided by the synchronization primitive where we are using one, which we aren't here. It's true that IPM is sort of a special case, as it was intended for a asymmetric multiprocessing architecture (Quark SE -- one x86 and one ARC on the same memory bus) with no cache. But that's not really something we can fix with this patch.

@aescolar aescolar added area: Networking area: Console area: Tests Issues related to a particular existing or missing test area: Kernel area: SMP Symmetric multiprocessing and removed area: Networking labels Mar 24, 2019
@wentongwu wentongwu added this to the v1.14.0 milestone Mar 25, 2019
@wentongwu wentongwu force-pushed the smp_12478 branch 2 times, most recently from 9466d2a to bbb6d41 Compare March 28, 2019 08:17
@galak galak added the bug The issue is a bug, or the PR is fixing a bug label Mar 28, 2019
@wentongwu
Copy link
Contributor Author

@andyross could you please review the new changes?

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the fact that this is hanging on x86_64 in the CI test implies that this probably doesn't fix the bug it claims to fix. :)

Beyond that, the meat of the changes seem to make sense: you need to have locking around the ring buffer in SMP where you don't in UP, and you need to add a sleep in the test because the other CPU may try to mutate the thing before your test code is prepared for the next step. Those seem fine.

But I don't understand the removal of the (fairly clearly explained) buffer full handling at all. Why is this not introducing a failure case when the buffer fills up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it's OK to remove this. That condition (an interrupt delivering data that fills the buffer) seems like it can still happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to use the flags(enabled and channel_disable), there is possibility that the flags are accessed by ISR and receiver thread at the same time if SMP enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volatile is incorrect here. This delcares a volatile pointer (i.e. it tells the compiler that the memory where the local driver_data variable is stored may change and must not be optimized into a register, which doesn't make sense because no other thread is going to be mutating our stack like that). You want a pointer to a volatile struct instead:

volatile struct ipm_console_receiver_runtime_data *driver_data;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, I didn't get the right usage of volatile in that place, and above change should be the cause that make the CI test fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before: why is it OK to remove this? The condition and the comment explaining it (a RX interrupt is out of space in the buffer) makes sense, and I don't see anything in the patch that addresses it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sender thread will wait until there is space in ring buffer to do next send.

  •   while (ring_buf_space_get(&driver_data->rb) == 0) {
    
  •   }
    

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? It sounds like the original test expects the busy flag to be cleared by an ISR before returning, and now it's going to return synchronously before the interrupt (or whatever) reports success.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, sender thread will wait until there is at least one free buffer for the ring buffer.

@wentongwu wentongwu force-pushed the smp_12478 branch 3 times, most recently from eb588f5 to 61b3c95 Compare April 3, 2019 10:20
@nashif nashif requested a review from andyross April 3, 2019 14:45
@andyross
Copy link
Contributor

andyross commented Apr 3, 2019

OK, I see the spinning behavior and sort of understand. And to be fair, IPM is a mostly unused, oddball subsystem. But... that's a fairly major API change! You're changing the behavior of the code from "disable and return a synchronous error" to "spin forever if the other side doesn't read". That's not really in the spirit of "bug fix for SMP".

And I don't understand why you have to do it. The same tricks that work with this code as written (it was originally for a multiprocessing setup!) should be able to work when there are extra cores on one side, with a little locking.

Current design of ipm test doesn't consider SMP, there is the
possibility that offload_irq and rx_thread run at different
cpu core at the same time, it will cause state(enabled and
channel_disable) disorder if two different context access them
concurrently.

Fixes zephyrproject-rtos#12478.

Signed-off-by: Wentong Wu <[email protected]>
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I took a look today. It really does seem like this is only an SMP safety thing, and we've been working around that by disabling SMP for 1.14. We can come back to this in 1.15 and do it right.

@nashif nashif modified the milestones: v1.14.0, future Apr 12, 2019
@nashif nashif closed this Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Console area: Kernel area: SMP Symmetric multiprocessing area: Tests Issues related to a particular existing or missing test bug The issue is a bug, or the PR is fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants