Skip to content

Fix segmentation fault when starting too fast.#196

Open
manio wants to merge 2 commits intonisargjhaveri:mainfrom
manio:segv
Open

Fix segmentation fault when starting too fast.#196
manio wants to merge 2 commits intonisargjhaveri:mainfrom
manio:segv

Conversation

@manio
Copy link
Contributor

@manio manio commented Sep 26, 2024

Nisarg,
Since the beginning I had sometimes problems with starting the aawgd. Fortunately it occurs not every time, but since I installed my additional binary for remote opening my door using LoRa (it was also autostarting), then the problem was even more visible. All the time aawgd was not able to start - I had to add about 6-7 seconds delay before the start in the startup script OR start it manually after ssh log-in.

Yesterday I spent a whole day debugging this problem. I found out that the aawgd segfaults when starting up. The call which was leading to crash was:
m_adapter->powered->set_value(on);
I was testing the m_adapter and powered - and those was not null.

Finally I was able to create a fix for this problem. It is in this PR...

It seems that it is working great now (starts so fast, that even automatic music playing is starting, which was not always the case in my car), but please leave it open as I would like to test it several more days to make sure...

This commit is trying to fix an early segfault.
If the bluetooth stack is not ready, then the `Powered` value is null.
Calling the `set_value` in this stage is causing `Segmentation fault`.

This was leading to problems with automatic AA start on the Pi,
so it was often necessary to start it manually afterwards:
`/etc/init.d/S93aawgd start` (the process was not running)
The info line was misleading (should be "has been" instead of "was").
The commit is fix this info message and also:
- no newline at end of file
@manio
Copy link
Contributor Author

manio commented Sep 29, 2024

Hi,
OK, I tested it and here are my results:
I tested fresh starting the Pi for about 30 times. The daemon started fine without any problems (successfull AA connection) all time besides two cases (which are rather not related to this fix):

This with
hci0: BCM: Reading device address failed (-110)
and this with some
Bluetooth device already connected, disconnecting loop.
One of the above (the first one IIRC) was a segfault and the aawgd was not running, the other case it was running (but cannot connect to my phone and AA; it connects fine after an aawgd restart).

When I reverted this PR's fix from the binary - It is not starting mostly every time, so I believe this is a proper solution for this problem.


// wait up to 10 seconds if BT stack is not ready
int i = 10;
while (i-- && !m_adapter->powered->value()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR. Though, I don't think I understand this properly. Wouldn't this always be false until we actually call set_value below? Or are we simply adding a 10s delay? What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be some BT race here as the problem occurs only when Pi (thus BT service) is starting. It is just segfaults immediately when you call set_value() when the BT service is not yet ready. Adding additional value() call until it results something that is not null just makes sure that the whole BT stack is ready for our requests. Then it is not segfaulting.
In my case it is generally breaking the loop after first NULL try, but I gave it 10 seconds (tries) for sure.
I know it is weird - I also think that it should just work right away calling set_value but it is not (segfaults). The fix is just making sure it will not segfault and this is leading to fluent start with every single reboot.

And as I told - the problem is more visible when more processes is starting up (probably Pi's CPU is more utilized)...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the gist on the issue and what you're trying to do.

But AFAIK, m_adapter->powered->value() should return a bool and can never be null. If it is not segfaulting, it should return false always because we've never tried to set powered to true. And in that case, the loop will always run for all 10 iterations before moving forward, no?

Does m_adapter->powered->value() return null on the first iteration, what is the value on later iterations? What is the return type of the value method? I'm very likely still missing something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the m_adapter->powered->value() which is segfaulting but a m_adapter->powered->set_value(on).

To provide you some detailed info I changed the code to the following:

    // wait up to 10 seconds if BT stack is not ready
    DBus::Variant retval;
    int i = 10;
    do {
        retval = m_adapter->powered->value();
        Logger::instance()->info("i=%d retval=%p\n",i , retval);
        sleep(1);
    } while (i--);

And after rebooting I've got in messages:

Jan  1 00:00:05 buildroot user.info aawgd[237]: i=10 retval=0x7e972c08
Jan  1 00:00:06 buildroot user.info aawgd[237]: i=9 retval=0x7e972c08
Jan  1 00:00:07 buildroot user.info aawgd[237]: i=8 retval=0x7e972c08
Jan  1 00:00:08 buildroot user.info aawgd[237]: i=7 retval=0x7e972c08
Jan  1 00:00:09 buildroot user.info aawgd[237]: i=6 retval=0x7e972c08
Jan  1 00:00:10 buildroot user.info aawgd[237]: i=5 retval=0x7e972c08
Jan  1 00:00:11 buildroot user.info aawgd[237]: i=4 retval=0x7e972c08
Jan  1 00:00:12 buildroot user.info aawgd[237]: i=3 retval=0x7e972c08
Jan  1 00:00:13 buildroot user.info aawgd[237]: i=2 retval=0x7e972c08
Jan  1 00:00:14 buildroot user.info aawgd[237]: i=1 retval=0x7e972c08

Then I reverted my whole patch - so it is your main without any changes and it segfaults again right after:
Tcp server listening on 5288.

Lastly I added this:

    Logger::instance()->info("re=%p\n", m_adapter->powered->value());
    sleep(1);
    Logger::instance()->info("re=%p\n", m_adapter->powered->value());
    sleep(1);

in the same place as my patch and after reboot I've got:

Jan  1 00:00:04 buildroot user.info aawgd[236]: Tcp server listening on 5288
Jan  1 00:00:04 buildroot user.info aawgd[236]: re=(nil)
Jan  1 00:00:05 buildroot user.info aawgd[236]: re=0x1
Jan  1 00:00:06 buildroot user.info aawgd[236]: Bluetooth adapter powered on
Jan  1 00:00:06 buildroot user.info aawgd[236]: Bluetooth adapter is now discoverable and pairable

and it doesn't crash. But look at first line where the pointer is NULL/(nil). After this call all is working fine, but before/without this, the call to m_adapter->powered->set_value(on) resulting a SEGV.

I am starting to wonder if a single call to value() would be sufficient - maybe - in fact my loop is making the same - calling it once.

According to the documentation the returned type is Variant

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the explanation and logs! I understand much better now.

Any idea why this would be nil? This is a dbus proxy object as you also figured. And not even the first one we refer to on m_adapter as well. Can other values also be null?

I haven't faced this issue. Do you face this with prebuilt images as well? Any minimal repro you can provide where I can maybe also repro. I just want to make sure we don't have this issue somewhere else as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started this project from you prebuild image, then I only added my LoRa binary.

I stripped down the binary I am using to have the minimal version which is leading to the problem. All you need is this: https://skyboo.net/temp/aa/lora-gw-rs.tar.bz2

There is a binary + startup file and also a exact sources from which this binary was build - you can crosscompile yourself if you like.

I hope you will also be able to reproduce this...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try to repro this a few days back, unfortunately I was not able to.

I did try removing bluetooth initialization altogether, and as expected that fails the first time we try to access something. Here, it fails specifically on powered property.

Do we need to guard all property set calls with something like this? Is there a dbus or bluetoothd guidance?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed the later comment around the org.bluez.Error.Busy error. That might make more sense. I think we might be missing the error handling here.

@manio
Copy link
Contributor Author

manio commented Nov 3, 2024

Nisarg,
Regarding this issue, I am starting to wonder if the problem isn't is that bluetoothd is not ready when staring. In case of my project I have this:

1970-01-01, 00:00:04.928 [INFO]  bluetooth:  🥏 Opened bluetooth adapter hci0 with address B8:27:EB:CD:D6:B2
1970-01-01, 00:00:04.940 [ERROR]  main:  Bluetooth error: internal error: D-Bus error org.bluez.Error.Busy
1970-01-01, 00:00:04.940 [INFO]  main:  Trying to recover...

On second try (after a second) it is opened successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants