Skip to content

[APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning.#70

Closed
maxhorowitz wants to merge 0 commit intomainfrom
interface
Closed

[APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning.#70
maxhorowitz wants to merge 0 commit intomainfrom
interface

Conversation

@maxhorowitz
Copy link

@maxhorowitz maxhorowitz commented Feb 20, 2025

Summary

This PR encompasses everything needed for adding bluetooth provisioning as a default provisioning method in the viam-agent (and is designed to run in parallel to the existing WiFi hotspot provisioning method).

Setup

I want you to test as if you are a customer who has just received a machine in the mail. The machine will just have a viam-agent binary installation. It will not have WiFi credentials (and thus no internet connectivity).

Machine setup

Using a Linux machine/laptop (that you are comfortable getting your /opt/viam/*, /etc/viam/* and other files tampered with) , pull down this branch. Because you will need to remove connectivity to emulate the user perspective, please do not SSH into a machine over WiFi (and don't use Ethernet because it can interfere with provisioning). Instead, test directly on the device (keyboard + mouse) needed.

Uninstall Viam

First, you will need to remove any Viam-related stuff from your computer (to emulate the "newly-out-of-the-box" scenario our customers will be in). Please run the following from the repository root:
sudo ./uninstall.sh

Remove local networks

You will then need to remove local networks from your machine to emulate the "offline-ness" that our customers will be dealing with. I've been running nmcli con show and subsequently sudo nmcli con delete <name> (for every network that could "interfere" with the provisioning flow).

Preinstall the Viam Agent

Then, run the following (again from the repository root):

  1. make
  2. sudo ./bin/viam-agent-custom-aarch64 --install
  3. sudo /opt/viam/bin/viam-agent --debug

I've been running the commands altogether as such:
make && sudo ./bin/viam-agent-custom-aarch64 --install && sudo /opt/viam/bin/viam-agent --debug

Phone setup

Download the LightBlue app (available on both Android and iPhone). This is what we will use to communicate with the bluetooth service that is being advertised from the Linux machine.

Testing Procedure

At this point, we are ready to test.

LightBlue

Pairing

In the app, use the search bar to find the name of the Linux machine that you are currently testing against. It should be there. Pair with your machine (one of the tests is to see if the pairing request is automatically accepted on the machine side, so I am hoping this part works!).

If it does not pair, check your screen for a six-digit pairing code and accept the request manually. This is a known limitation that I am working on fixing. The TLDR is it's complicated because of reliance on systemd bus signals which may be picked up and discarded as negligible elsewhere in the viam-agent.

Bluetooth service used to "transmit" credentials

Once paired, look at the bluetooth service and nested characteristics available to us. You should be able to take the provided UUIDs and map them to the logs on your viam-agent machine:

There is an encoding that will be helpful for you to know here. Characteristics whose last 4 characters of their first 8 character sequence (preceding the -) will always be the following:

  • xxxx1111-... is the encompassing service
  • xxxx2222-... is the write-only characteristic for SSID
  • xxxx3333-... is the write-only characteristic for passkey
  • xxxx4444-... is the write-only characteristic for part ID
  • xxxx5555-... is the write-only characteristic for part secret
  • xxxx6666-... is the write-only characteristic for app address
  • xxxx7777-... is the read-only characteristic for nearby available WiFi networks that the machine has detected
2025-02-25T20:41:24.245Z	DEBUG	viam-agent	networking/bluetooth.go:507	BlueZ version (5.66) meets the requirement (5.66 or later)
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:43	Bluetooth peripheral service UUID: 79ff1111-4f38-44b9-b3b5-78fb7e14757e
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:45	WiFi SSID can be written to the following bluetooth characteristic: db8f2222-baea-452f-bad0-62b440f98161
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:47	WiFi passkey can be written to the following bluetooth characteristic: 06af3333-4442-42e5-9e6a-1729ffa1378d
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:49	Robot part key ID can be written to the following bluetooth characteristic: 26d24444-161f-4474-a4a5-e83c563ac52b
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:51	Robot part key can be written to the following bluetooth characteristic: 12db5555-ca13-4a0b-879a-c76fb1542fe1
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:53	Viam app address can be written to the following bluetooth characteristic: 9eaa6666-14f0-47a9-b69f-a093f8358b18
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:55	Available WiFi networks can be read from the following bluetooth characteristic: a8ee7777-2496-485a-b0ea-dc63a1122f1

The LightBlue interface is pretty easy to follow. You may need to change LightBlue messages from hex or binary to utf-8 string. Once confirmed you're in utf-8 string mode, you can write individual messages to the SSID, passkey, part ID, part secret, and app address "characteristics" by clicking "write value." Similarly, you can read from the available WiFi networks "characteristic" by clicking "read value." Any value written will get "sucked in" to the viam-agent provisioning flow. You can check this in the viam-agent logs. Once all values are submitted, the provisioning loop should close out the BT connection, connect to WiFi, retrieve its cloud config, and should start up a viam-server (thus ending the provisioning loop).

Cases

  1. Hotspot provisioning flow (i.e. check the captive portal flow still works)
  2. Bluetooth provisioning flow (i.e. check the LightBlue flow works)

Comment on lines +226 to +231
userInput, err := n.bluetoothService.waitForCredentials(ctx, true, true) // Background goroutine ultimately cancelled by context.
if err != nil {
n.logger.Errorw("failed to wait for user input of credentials", "err", err)
return
}
inputChan <- *userInput
Copy link
Author

@maxhorowitz maxhorowitz Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would people like it more if I pass the inputChan into the waitForCredentials call itself? Then, I could get rid of the goroutine here and make the waitForCredentials call do nonblocking listening. In that case I would rename it to listenForCredentials.

Signal int32
Connected bool
LastError string
Type string `json:"type,omitempty"`
Copy link
Author

@maxhorowitz maxhorowitz Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json:"xxx" tags are shorthand on purpose to minimize bytes wasted as uninformative JSON keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.

Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.

@maxhorowitz maxhorowitz changed the title [APP-7566]: Bluetooth Provisioning [APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning. Feb 24, 2025
)

// bluetoothService provides an interface for retrieving cloud config and/or WiFi credentials for a robot over bluetooth.
type bluetoothService interface {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private interface lets us

  1. protect all linux related BLE behaviors from the provisioning code path, and
  2. allow us to make modular unit tests that validate state behaviors in our provisioning flow in the event of various (mocked) bluetooth errors

@@ -0,0 +1,630 @@
package networking
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking having a difficult time enforcing behaviors via unit testing the functions/methods in this file because many of them are low level BT commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem with wifi provisioning... unit tests for stuff that interacts with a complex system are just nearly impossible to create in a useful way. So don't worry too much about that. We just have to do human-in-the-loop testing.

return err
}
return errors.Join(err, err2)
n.connState.setProvisioning(false)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this from top of function (it should only be false after clean shutdown, right?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's the exact opposite. You set state when you're TRYING to close down that state. If things fail halfway through, then it doesn't think it's in a WORKING provisioning mode, and will restart from the ground up.

For STARTUP, the opposite is true. Mark it at the end, when it's fully started/good.

Copy link
Contributor

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize in advance for how long this review is. I'm leaving a summary here at the top of (most) of the important things, so it's easier to discuss (and check off as you go) hopefully. But it will overlap a lot with the inline comments too.

As always, hit me up in slack to discuss things whenever. Plenty of this is non-obvious, as you're deep in the guts of the most complicated part of Agent here.

  1. Background go routines/threads need to be used only when/where required, and must be fully managable. Remember that calls to start/stop can come in at any point in the flow, so everything must be able to exit cleanly and quickly. No need to background threads if you're just going to wait for them in the same fuction. Just handle things linearly to keep it simpler.

  2. Need to be careful about data races, and protect any data that can be changed from another thread with locks. Note that even if it's not explicitly in a thread, outside calls from start/stop/update can happen at any point, so everything in those paths has to be race-safe.

  3. BT characteristics (and their associated UUIDs) are a bunch of indivudual variables resulting in ~1/3 of bluetooth.go being very repetitive boilerplate. Should get them arranged into a simpler data structure, with a map to use directly, or a true structure with (generalized) getter/setter methods. E.g. calls should be able to look like val, err := bt.GetCharacteristic("ssid") or bt.SetCharacteristic("ssid", myVal) and be human readable. Look at networkState and connState for hints.

  4. As discussed in slack, the 20 character limit for charcteristic could be a real problem. SSIDs and PSKs can be longer than that by themselves. And scan results may contains dozens of networks.

  5. Need a new config setting to disable bluetooth.

  6. If agent starts on a device without bluetooth (or otherwise unsupport version/etc.), it needs to detect this and quietly do nothing (beyond an initial log.) Likely best to integrate with above, and just have a "temporary" flag that disables it until a full restart.

  7. BLE needs to integrate with the health checking. Any backgrounded routines need to report health status regularly using health.healthySleep() (add a new one on Networking{} to track ble health.)

  8. Ideally, this should be abstracted just like the portal is. E.g. whenever start/stopPortal() is called, start/stopBLE() is called, and otherwise works the same. They should even use the same inputChan, so the outer provisioning code only has to listen to that one channel.

  9. Network scans (and other info) are updated in real time. This is why they're managed in their own structures like n.connState(). BLE needs to use those and not just pass in a static list one time at startup.

Signal int32
Connected bool
LastError string
Type string `json:"type,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.

Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.

webServer *http.Server
grpcServer *grpc.Server
portalData *portalData
hotspotIsActive bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the functionality of connState.GetProvisioning() (which properly mutex locks things and records timestamps.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I will incorporate this there. We need a way to distinguish whether both BT and/or Hotspot methods are active so we know which need to get shut down cleanly on close.

err = errors.Join(err, n.deactivateConnection(n.Config().HotspotInterface, n.Config().HotspotSSID))
return errw.Wrap(err, "starting web/grpc portal")
// Simultaneously start both the hotspot captive portal and the bluetooth service provisioning methods.
wg := sync.WaitGroup{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wg := sync.WaitGroup{}
var wg sync.WaitGroup

Empty vars should be declared with "var" when possible, not an empty struct

)
n.bluetoothIsActive = true
}, wg.Done)
wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why you're backgrounding functions above just to wait here. WaitGroups are for async thread management. You're firing off async threads then waiting in the same function that just fired them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goal was to allow BT and Hotspot to start up in parallel. I can do linear in place of this. In that case, I will have to be a little more clever about error handling as to not entirely return early if the first of the two methods doesn't start up properly.

bluetoothErr = err
return
}
goutils.ManagedGo( // Listen for user input asynchronously. How should we be handling errors here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is firing a new thread from within another backgrounded thread? This one doesn't look like it's being tracked/managed by a waitgroup, so it'll just be orphaned as far as I can tell.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. It is not tracked by the waitgroup because it should outlive the life of this current function (being that it waits for user input). It will get cancelled if the context in this function is canceled. Perhaps to handle this situation we should have a child context that is 1<>1 with provisioning being "live" and we can cancel it once we exit the provisioning loop? The only scenario I could think of is if this thread is alive, the overall context is alive, and we've used hotspot provisioning instead of BT. I think the solution here is to have a provisioning context - that way both the hotspot and the BT can "cancel one another out" if the other is first to be used.

Comment on lines +266 to +268
return fmt.Errorf("failed to stop advertising: %w", err)
}
bsl.advActive = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this fails and returns an error, but is still marked active?

}

// getBlueZVersion retrieves the installed BlueZ version and extracts the numeric value correctly.
func getBlueZVersion() (float64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get/check the version via dbus (which is how the rest of this interacts) not via exec calls (if at all possible.) Userspace utilities may not actually be what's running, or may not even be installed. If version isn't available directly, then check that the properties needed exist perhaps. Should allow wider compatibility too.

Comment on lines +438 to +443
func checkOS() error {
if runtime.GOOS != "linux" {
return fmt.Errorf("this program requires Linux, detected: %s", runtime.GOOS)
}
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely shouldn't exist. Use build flags to guard code that simply can't run elsewhere at compile time, not checking at runtime.

}

// Not ready to return (do not have the minimum required set of credentials), so sleep and try again.
time.Sleep(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to integrate with healthchecks. Use the healthySleep functions there. May need to extend if the health of bluetooth is separate from the general subsystems loops.

goutils.ManagedGo(func() {
if n.bluetoothService == nil {
bt, err := newBluetoothService(
n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the config values should be n.Config() otherwise you can hit a race.

Suggested change
n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks())
n.logger, fmt.Sprintf("%s.%s.%s", n.Config().Manufacturer, n.Config().Model, n.Config().FragmentID), n.getVisibleNetworks())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.getVisibleNetworks() is a one-time call here, but the list of visible networks changes frequently, especially if someone is trying to get a new network to show up. This needs to be handled like portal does, and updated when things change, not passed in once here only at startup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. n.Config() - will use thank you for catching, silly mistake.
  2. n.getVisibleNetworks() - I initially tried to have a channel to update the current value stored in the "available wifi network" characteristic, but this didn't work properly. I need to look into whether it is easily possible to tear down an active characteristic and post a new one while the BT service is already being advertised. The issue here is I don't want to tear down the BT service so that I can update available WiFi networks while the client could be inputting their Robot Part ID (i.e. trying to input information that has nothing to do with the WiFi networks)

@maxhorowitz
Copy link
Author

@ale7714 @abe-winter Took you both off until I've addressed James' comments so we're not duplicating reviews here! Thanks

@ale7714
Copy link
Member

ale7714 commented Feb 27, 2025

@maxhorowitz @Otterverse I did a very quick googling about the 20 character limit because I'm sure other folks want to send data over BT that is larger than 20 chars and the recommended pattern to follow is to chunk your data in 20 byte slices. Something like

        var data []byte
        for _, net := range networks {
            data = append(data, []byte(netw.SSID)...)
            data = append(data, byte(net.RSSI))
        }

        chunkSize := 20 // BLE characteristic size limit.
        for i := 0; i < len(data); i += chunkSize {
            end := i + chunkSize
            if end > len(data) {
                end = len(data)
            }
            ble.Write(data[i:end])
        }

Not actual working code just to illustrate what i mean. Instead of trying to shorten the json payload, I suspect this approach will be a bit more resilient long term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants