Fix segmentation fault when starting too fast.#196
Fix segmentation fault when starting too fast.#196manio wants to merge 2 commits intonisargjhaveri:mainfrom
Conversation
This commit is trying to fix an early segfault. If the bluetooth stack is not ready, then the `Powered` value is null. Calling the `set_value` in this stage is causing `Segmentation fault`. This was leading to problems with automatic AA start on the Pi, so it was often necessary to start it manually afterwards: `/etc/init.d/S93aawgd start` (the process was not running)
The info line was misleading (should be "has been" instead of "was"). The commit is fix this info message and also: - no newline at end of file
|
Hi, This with When I reverted this PR's fix from the binary - It is not starting mostly every time, so I believe this is a proper solution for this problem. |
|
|
||
| // wait up to 10 seconds if BT stack is not ready | ||
| int i = 10; | ||
| while (i-- && !m_adapter->powered->value()) { |
There was a problem hiding this comment.
Thanks a lot for the PR. Though, I don't think I understand this properly. Wouldn't this always be false until we actually call set_value below? Or are we simply adding a 10s delay? What am I missing?
There was a problem hiding this comment.
I think it could be some BT race here as the problem occurs only when Pi (thus BT service) is starting. It is just segfaults immediately when you call set_value() when the BT service is not yet ready. Adding additional value() call until it results something that is not null just makes sure that the whole BT stack is ready for our requests. Then it is not segfaulting.
In my case it is generally breaking the loop after first NULL try, but I gave it 10 seconds (tries) for sure.
I know it is weird - I also think that it should just work right away calling set_value but it is not (segfaults). The fix is just making sure it will not segfault and this is leading to fluent start with every single reboot.
And as I told - the problem is more visible when more processes is starting up (probably Pi's CPU is more utilized)...
There was a problem hiding this comment.
I think I understand the gist on the issue and what you're trying to do.
But AFAIK, m_adapter->powered->value() should return a bool and can never be null. If it is not segfaulting, it should return false always because we've never tried to set powered to true. And in that case, the loop will always run for all 10 iterations before moving forward, no?
Does m_adapter->powered->value() return null on the first iteration, what is the value on later iterations? What is the return type of the value method? I'm very likely still missing something here.
There was a problem hiding this comment.
It is not the m_adapter->powered->value() which is segfaulting but a m_adapter->powered->set_value(on).
To provide you some detailed info I changed the code to the following:
// wait up to 10 seconds if BT stack is not ready
DBus::Variant retval;
int i = 10;
do {
retval = m_adapter->powered->value();
Logger::instance()->info("i=%d retval=%p\n",i , retval);
sleep(1);
} while (i--);And after rebooting I've got in messages:
Jan 1 00:00:05 buildroot user.info aawgd[237]: i=10 retval=0x7e972c08
Jan 1 00:00:06 buildroot user.info aawgd[237]: i=9 retval=0x7e972c08
Jan 1 00:00:07 buildroot user.info aawgd[237]: i=8 retval=0x7e972c08
Jan 1 00:00:08 buildroot user.info aawgd[237]: i=7 retval=0x7e972c08
Jan 1 00:00:09 buildroot user.info aawgd[237]: i=6 retval=0x7e972c08
Jan 1 00:00:10 buildroot user.info aawgd[237]: i=5 retval=0x7e972c08
Jan 1 00:00:11 buildroot user.info aawgd[237]: i=4 retval=0x7e972c08
Jan 1 00:00:12 buildroot user.info aawgd[237]: i=3 retval=0x7e972c08
Jan 1 00:00:13 buildroot user.info aawgd[237]: i=2 retval=0x7e972c08
Jan 1 00:00:14 buildroot user.info aawgd[237]: i=1 retval=0x7e972c08
Then I reverted my whole patch - so it is your main without any changes and it segfaults again right after:
Tcp server listening on 5288.
Lastly I added this:
Logger::instance()->info("re=%p\n", m_adapter->powered->value());
sleep(1);
Logger::instance()->info("re=%p\n", m_adapter->powered->value());
sleep(1);in the same place as my patch and after reboot I've got:
Jan 1 00:00:04 buildroot user.info aawgd[236]: Tcp server listening on 5288
Jan 1 00:00:04 buildroot user.info aawgd[236]: re=(nil)
Jan 1 00:00:05 buildroot user.info aawgd[236]: re=0x1
Jan 1 00:00:06 buildroot user.info aawgd[236]: Bluetooth adapter powered on
Jan 1 00:00:06 buildroot user.info aawgd[236]: Bluetooth adapter is now discoverable and pairable
and it doesn't crash. But look at first line where the pointer is NULL/(nil). After this call all is working fine, but before/without this, the call to m_adapter->powered->set_value(on) resulting a SEGV.
I am starting to wonder if a single call to value() would be sufficient - maybe - in fact my loop is making the same - calling it once.
According to the documentation the returned type is Variant
There was a problem hiding this comment.
Thanks a lot for the explanation and logs! I understand much better now.
Any idea why this would be nil? This is a dbus proxy object as you also figured. And not even the first one we refer to on m_adapter as well. Can other values also be null?
I haven't faced this issue. Do you face this with prebuilt images as well? Any minimal repro you can provide where I can maybe also repro. I just want to make sure we don't have this issue somewhere else as well.
There was a problem hiding this comment.
I started this project from you prebuild image, then I only added my LoRa binary.
I stripped down the binary I am using to have the minimal version which is leading to the problem. All you need is this: https://skyboo.net/temp/aa/lora-gw-rs.tar.bz2
There is a binary + startup file and also a exact sources from which this binary was build - you can crosscompile yourself if you like.
I hope you will also be able to reproduce this...
There was a problem hiding this comment.
I did try to repro this a few days back, unfortunately I was not able to.
I did try removing bluetooth initialization altogether, and as expected that fails the first time we try to access something. Here, it fails specifically on powered property.
Do we need to guard all property set calls with something like this? Is there a dbus or bluetoothd guidance?
There was a problem hiding this comment.
I think I missed the later comment around the org.bluez.Error.Busy error. That might make more sense. I think we might be missing the error handling here.
|
Nisarg, On second try (after a second) it is opened successfully. |
Nisarg,
Since the beginning I had sometimes problems with starting the
aawgd. Fortunately it occurs not every time, but since I installed my additional binary for remote opening my door using LoRa (it was also autostarting), then the problem was even more visible. All the time aawgd was not able to start - I had to add about 6-7 seconds delay before the start in the startup script OR start it manually after ssh log-in.Yesterday I spent a whole day debugging this problem. I found out that the
aawgdsegfaults when starting up. The call which was leading to crash was:m_adapter->powered->set_value(on);I was testing the
m_adapterandpowered- and those was not null.Finally I was able to create a fix for this problem. It is in this PR...
It seems that it is working great now (starts so fast, that even automatic music playing is starting, which was not always the case in my car), but please leave it open as I would like to test it several more days to make sure...