[APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning.#70
[APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning.#70maxhorowitz wants to merge 0 commit intomainfrom
Conversation
| userInput, err := n.bluetoothService.waitForCredentials(ctx, true, true) // Background goroutine ultimately cancelled by context. | ||
| if err != nil { | ||
| n.logger.Errorw("failed to wait for user input of credentials", "err", err) | ||
| return | ||
| } | ||
| inputChan <- *userInput |
There was a problem hiding this comment.
Would people like it more if I pass the inputChan into the waitForCredentials call itself? Then, I could get rid of the goroutine here and make the waitForCredentials call do nonblocking listening. In that case I would rename it to listenForCredentials.
subsystems/networking/definitions.go
Outdated
| Signal int32 | ||
| Connected bool | ||
| LastError string | ||
| Type string `json:"type,omitempty"` |
There was a problem hiding this comment.
json:"xxx" tags are shorthand on purpose to minimize bytes wasted as uninformative JSON keys.
There was a problem hiding this comment.
Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.
Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.
subsystems/networking/bluetooth.go
Outdated
| ) | ||
|
|
||
| // bluetoothService provides an interface for retrieving cloud config and/or WiFi credentials for a robot over bluetooth. | ||
| type bluetoothService interface { |
There was a problem hiding this comment.
The private interface lets us
- protect all linux related BLE behaviors from the provisioning code path, and
- allow us to make modular unit tests that validate state behaviors in our provisioning flow in the event of various (mocked) bluetooth errors
subsystems/networking/bluetooth.go
Outdated
| @@ -0,0 +1,630 @@ | |||
| package networking | |||
There was a problem hiding this comment.
Generally speaking having a difficult time enforcing behaviors via unit testing the functions/methods in this file because many of them are low level BT commands.
There was a problem hiding this comment.
Same problem with wifi provisioning... unit tests for stuff that interacts with a complex system are just nearly impossible to create in a useful way. So don't worry too much about that. We just have to do human-in-the-loop testing.
| return err | ||
| } | ||
| return errors.Join(err, err2) | ||
| n.connState.setProvisioning(false) |
There was a problem hiding this comment.
Moved this from top of function (it should only be false after clean shutdown, right?).
There was a problem hiding this comment.
No, that's the exact opposite. You set state when you're TRYING to close down that state. If things fail halfway through, then it doesn't think it's in a WORKING provisioning mode, and will restart from the ground up.
For STARTUP, the opposite is true. Mark it at the end, when it's fully started/good.
Otterverse
left a comment
There was a problem hiding this comment.
I apologize in advance for how long this review is. I'm leaving a summary here at the top of (most) of the important things, so it's easier to discuss (and check off as you go) hopefully. But it will overlap a lot with the inline comments too.
As always, hit me up in slack to discuss things whenever. Plenty of this is non-obvious, as you're deep in the guts of the most complicated part of Agent here.
-
Background go routines/threads need to be used only when/where required, and must be fully managable. Remember that calls to start/stop can come in at any point in the flow, so everything must be able to exit cleanly and quickly. No need to background threads if you're just going to wait for them in the same fuction. Just handle things linearly to keep it simpler.
-
Need to be careful about data races, and protect any data that can be changed from another thread with locks. Note that even if it's not explicitly in a thread, outside calls from start/stop/update can happen at any point, so everything in those paths has to be race-safe.
-
BT characteristics (and their associated UUIDs) are a bunch of indivudual variables resulting in ~1/3 of bluetooth.go being very repetitive boilerplate. Should get them arranged into a simpler data structure, with a map to use directly, or a true structure with (generalized) getter/setter methods. E.g. calls should be able to look like
val, err := bt.GetCharacteristic("ssid")orbt.SetCharacteristic("ssid", myVal)and be human readable. Look at networkState and connState for hints. -
As discussed in slack, the 20 character limit for charcteristic could be a real problem. SSIDs and PSKs can be longer than that by themselves. And scan results may contains dozens of networks.
-
Need a new config setting to disable bluetooth.
-
If agent starts on a device without bluetooth (or otherwise unsupport version/etc.), it needs to detect this and quietly do nothing (beyond an initial log.) Likely best to integrate with above, and just have a "temporary" flag that disables it until a full restart.
-
BLE needs to integrate with the health checking. Any backgrounded routines need to report health status regularly using health.healthySleep() (add a new one on Networking{} to track ble health.)
-
Ideally, this should be abstracted just like the portal is. E.g. whenever start/stopPortal() is called, start/stopBLE() is called, and otherwise works the same. They should even use the same inputChan, so the outer provisioning code only has to listen to that one channel.
-
Network scans (and other info) are updated in real time. This is why they're managed in their own structures like n.connState(). BLE needs to use those and not just pass in a static list one time at startup.
subsystems/networking/definitions.go
Outdated
| Signal int32 | ||
| Connected bool | ||
| LastError string | ||
| Type string `json:"type,omitempty"` |
There was a problem hiding this comment.
Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.
Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.
subsystems/networking/networking.go
Outdated
| webServer *http.Server | ||
| grpcServer *grpc.Server | ||
| portalData *portalData | ||
| hotspotIsActive bool |
There was a problem hiding this comment.
This duplicates the functionality of connState.GetProvisioning() (which properly mutex locks things and records timestamps.)
There was a problem hiding this comment.
Cool I will incorporate this there. We need a way to distinguish whether both BT and/or Hotspot methods are active so we know which need to get shut down cleanly on close.
| err = errors.Join(err, n.deactivateConnection(n.Config().HotspotInterface, n.Config().HotspotSSID)) | ||
| return errw.Wrap(err, "starting web/grpc portal") | ||
| // Simultaneously start both the hotspot captive portal and the bluetooth service provisioning methods. | ||
| wg := sync.WaitGroup{} |
There was a problem hiding this comment.
| wg := sync.WaitGroup{} | |
| var wg sync.WaitGroup |
Empty vars should be declared with "var" when possible, not an empty struct
| ) | ||
| n.bluetoothIsActive = true | ||
| }, wg.Done) | ||
| wg.Wait() |
There was a problem hiding this comment.
I do not understand why you're backgrounding functions above just to wait here. WaitGroups are for async thread management. You're firing off async threads then waiting in the same function that just fired them.
There was a problem hiding this comment.
Goal was to allow BT and Hotspot to start up in parallel. I can do linear in place of this. In that case, I will have to be a little more clever about error handling as to not entirely return early if the first of the two methods doesn't start up properly.
| bluetoothErr = err | ||
| return | ||
| } | ||
| goutils.ManagedGo( // Listen for user input asynchronously. How should we be handling errors here? |
There was a problem hiding this comment.
This is firing a new thread from within another backgrounded thread? This one doesn't look like it's being tracked/managed by a waitgroup, so it'll just be orphaned as far as I can tell.
There was a problem hiding this comment.
Yes, it is. It is not tracked by the waitgroup because it should outlive the life of this current function (being that it waits for user input). It will get cancelled if the context in this function is canceled. Perhaps to handle this situation we should have a child context that is 1<>1 with provisioning being "live" and we can cancel it once we exit the provisioning loop? The only scenario I could think of is if this thread is alive, the overall context is alive, and we've used hotspot provisioning instead of BT. I think the solution here is to have a provisioning context - that way both the hotspot and the BT can "cancel one another out" if the other is first to be used.
subsystems/networking/bluetooth.go
Outdated
| return fmt.Errorf("failed to stop advertising: %w", err) | ||
| } | ||
| bsl.advActive = false |
There was a problem hiding this comment.
What happens if this fails and returns an error, but is still marked active?
subsystems/networking/bluetooth.go
Outdated
| } | ||
|
|
||
| // getBlueZVersion retrieves the installed BlueZ version and extracts the numeric value correctly. | ||
| func getBlueZVersion() (float64, error) { |
There was a problem hiding this comment.
This should get/check the version via dbus (which is how the rest of this interacts) not via exec calls (if at all possible.) Userspace utilities may not actually be what's running, or may not even be installed. If version isn't available directly, then check that the properties needed exist perhaps. Should allow wider compatibility too.
subsystems/networking/bluetooth.go
Outdated
| func checkOS() error { | ||
| if runtime.GOOS != "linux" { | ||
| return fmt.Errorf("this program requires Linux, detected: %s", runtime.GOOS) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
This likely shouldn't exist. Use build flags to guard code that simply can't run elsewhere at compile time, not checking at runtime.
subsystems/networking/bluetooth.go
Outdated
| } | ||
|
|
||
| // Not ready to return (do not have the minimum required set of credentials), so sleep and try again. | ||
| time.Sleep(time.Second) |
There was a problem hiding this comment.
This needs to integrate with healthchecks. Use the healthySleep functions there. May need to extend if the health of bluetooth is separate from the general subsystems loops.
| goutils.ManagedGo(func() { | ||
| if n.bluetoothService == nil { | ||
| bt, err := newBluetoothService( | ||
| n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks()) |
There was a problem hiding this comment.
the config values should be n.Config() otherwise you can hit a race.
| n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks()) | |
| n.logger, fmt.Sprintf("%s.%s.%s", n.Config().Manufacturer, n.Config().Model, n.Config().FragmentID), n.getVisibleNetworks()) |
There was a problem hiding this comment.
n.getVisibleNetworks() is a one-time call here, but the list of visible networks changes frequently, especially if someone is trying to get a new network to show up. This needs to be handled like portal does, and updated when things change, not passed in once here only at startup.
There was a problem hiding this comment.
n.Config()- will use thank you for catching, silly mistake.n.getVisibleNetworks()- I initially tried to have a channel to update the current value stored in the "available wifi network" characteristic, but this didn't work properly. I need to look into whether it is easily possible to tear down an active characteristic and post a new one while the BT service is already being advertised. The issue here is I don't want to tear down the BT service so that I can update available WiFi networks while the client could be inputting their Robot Part ID (i.e. trying to input information that has nothing to do with the WiFi networks)
|
@ale7714 @abe-winter Took you both off until I've addressed James' comments so we're not duplicating reviews here! Thanks |
|
@maxhorowitz @Otterverse I did a very quick googling about the 20 character limit because I'm sure other folks want to send data over BT that is larger than 20 chars and the recommended pattern to follow is to chunk your data in 20 byte slices. Something like Not actual working code just to illustrate what i mean. Instead of trying to shorten the json payload, I suspect this approach will be a bit more resilient long term. |
Summary
This PR encompasses everything needed for adding bluetooth provisioning as a default provisioning method in the
viam-agent(and is designed to run in parallel to the existing WiFi hotspot provisioning method).Setup
I want you to test as if you are a customer who has just received a machine in the mail. The machine will just have a
viam-agentbinary installation. It will not have WiFi credentials (and thus no internet connectivity).Machine setup
Using a Linux machine/laptop (that you are comfortable getting your
/opt/viam/*,/etc/viam/*and other files tampered with) , pull down this branch. Because you will need to remove connectivity to emulate the user perspective, please do not SSH into a machine over WiFi (and don't use Ethernet because it can interfere with provisioning). Instead, test directly on the device (keyboard + mouse) needed.Uninstall Viam
First, you will need to remove any Viam-related stuff from your computer (to emulate the "newly-out-of-the-box" scenario our customers will be in). Please run the following from the repository root:
sudo ./uninstall.shRemove local networks
You will then need to remove local networks from your machine to emulate the "offline-ness" that our customers will be dealing with. I've been running
nmcli con showand subsequentlysudo nmcli con delete <name>(for every network that could "interfere" with the provisioning flow).Preinstall the Viam Agent
Then, run the following (again from the repository root):
makesudo ./bin/viam-agent-custom-aarch64 --installsudo /opt/viam/bin/viam-agent --debugI've been running the commands altogether as such:
make && sudo ./bin/viam-agent-custom-aarch64 --install && sudo /opt/viam/bin/viam-agent --debugPhone setup
Download the LightBlue app (available on both Android and iPhone). This is what we will use to communicate with the bluetooth service that is being advertised from the Linux machine.
Testing Procedure
At this point, we are ready to test.
LightBlue
Pairing
In the app, use the search bar to find the name of the Linux machine that you are currently testing against. It should be there. Pair with your machine (one of the tests is to see if the pairing request is automatically accepted on the machine side, so I am hoping this part works!).
If it does not pair, check your screen for a six-digit pairing code and accept the request manually. This is a known limitation that I am working on fixing. The TLDR is it's complicated because of reliance on systemd bus signals which may be picked up and discarded as negligible elsewhere in the
viam-agent.Bluetooth service used to "transmit" credentials
Once paired, look at the bluetooth service and nested characteristics available to us. You should be able to take the provided UUIDs and map them to the logs on your
viam-agentmachine:There is an encoding that will be helpful for you to know here. Characteristics whose last 4 characters of their first 8 character sequence (preceding the
-) will always be the following:xxxx1111-...is the encompassing servicexxxx2222-...is the write-only characteristic for SSIDxxxx3333-...is the write-only characteristic for passkeyxxxx4444-...is the write-only characteristic for part IDxxxx5555-...is the write-only characteristic for part secretxxxx6666-...is the write-only characteristic for app addressxxxx7777-...is the read-only characteristic for nearby available WiFi networks that the machine has detectedThe LightBlue interface is pretty easy to follow. You may need to change LightBlue messages from hex or binary to utf-8 string. Once confirmed you're in utf-8 string mode, you can write individual messages to the SSID, passkey, part ID, part secret, and app address "characteristics" by clicking "write value." Similarly, you can read from the available WiFi networks "characteristic" by clicking "read value." Any value written will get "sucked in" to the
viam-agentprovisioning flow. You can check this in theviam-agentlogs. Once all values are submitted, the provisioning loop should close out the BT connection, connect to WiFi, retrieve its cloud config, and should start up aviam-server(thus ending the provisioning loop).Cases