-
Notifications
You must be signed in to change notification settings - Fork 412
[WIP] MSC3898: Native Matrix VoIP signalling for cascaded foci (SFUs, MCUs...) #3898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 9 commits
750087f
aa53398
de302cb
7474782
5cad46d
2cbc2d6
f542fcb
6f01a94
575e16c
33b1880
65faee4
9882c97
c66bbe4
1b2d740
feb064b
d96d101
d538e1e
91470a2
2ef7425
5a186e4
b461525
e49e80d
6b3fd47
bf52e02
f81dd9d
9c32b96
6f8c9d1
ecf2425
bf04b17
1896fc7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,331 @@ | ||
# MSC3898: Native Matrix VoIP signalling for cascaded SFUs | ||
|
||
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401) | ||
specifies how full-mesh group calls work in Matrix. While that MSC works well | ||
for small group calls, it does not work so well for large conferences due to | ||
bandwidth (and other) issues. | ||
|
||
Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams | ||
between peers (which could be clients or SFUs or both). To make use of them | ||
effectively, peers need to be able to tell the SFU which streams they want to | ||
receive at what resolutions. | ||
|
||
To solve the issue of centralization, the SFUs are also allowed to connect to | ||
each other ("cascade") and therefore the peers also need a way to tell an SFU to | ||
which other SFUs to connect. | ||
|
||
## Proposal | ||
|
||
**TODO: spell out how this works with active speaker detection & associated | ||
signalling** **TODO: spell out how the DC traffic interacts with | ||
application-layer traffic** **TODO: how do we prove to the SFU that we have the | ||
right to subscribe to track?** | ||
|
||
### Diagrams | ||
|
||
The diagrams of how this all looks can be found in | ||
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401). | ||
|
||
### State events | ||
|
||
#### `m.call` state event | ||
|
||
This MSC proposes adding an _optional_ `m.foci` field to the `m.call` state | ||
event. It as list of recommended SFUs that the call initiator can recommend to | ||
users who do not want to use their own SFU (because they don't have one, or | ||
because they would be the only person on their SFU for their call, and so choose | ||
to connect direct to save bandwidth). | ||
|
||
For instance: | ||
|
||
```json | ||
{ | ||
"type": "m.call", | ||
"state_key": "cvsiu2893", | ||
"content": { | ||
"m.intent": "m.room", | ||
"m.type": "m.voice", | ||
"m.name": "Voice room", | ||
"m.foci": [ | ||
{ "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }, | ||
{ "user_id": "@sfu-nyc:matrix.org", "device_id": "VT4GA35VS" }, | ||
], | ||
} | ||
} | ||
``` | ||
|
||
#### `m.call.member` state event | ||
|
||
This MSC proposes adding an _optional_ `m.foci` field to the `m.call.member` | ||
state event. It is used, if the user wants to be contacted via an SFU rather | ||
than called directly (either 1:1 or full mesh). | ||
|
||
For instance: | ||
|
||
```json | ||
{ | ||
"type": "m.call.member", | ||
"state_key": "@matthew:matrix.org", | ||
"content": { | ||
"m.calls": [ | ||
{ | ||
"m.call_id": "cvsiu2893", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that the When the SFU sends To-Device messages to the clients, the Recently I've ran into an issue where I realized that It looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this comment belongs on MSC3401 as this line is specified in other MSC, although I thinik the conclusion is just that there's confusion between call_id and conf_id and we should rename this to conf_id (there's no other conf ID in this event so it is necessary, not just for backwards compat). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I've also written a comment about it in MSC3401 😛 Basically, the problem is not only that they are called differently, but also that the value of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you agree that the correct resolution is to change this to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that would be great! Though I wonder what the consequence of that would be (i.e. what is that value that the current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes 🙂 That's something that I discovered a week ago when deploying the first iteration of refactored SFU. I've just tried to join the SFU and the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. conf_id and call_id from where though? There will also be call_id in the individual calls which will definitely be different. Otherwise we need to work out what's going on here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
From To-Device messages that the participants of the conference send to the SFU. We then reply with To-Device messages back (e.g. when we generate an answer), in which case we also set both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/webrtc/call.ts#L2252? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, seems like this. But the thing is that, from the SFUs standpoint, the So my point is that we probably want to get rid of mandating |
||
"m.foci": [ | ||
{ "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }, | ||
{ "user_id": "@sfu-nyc:matrix.org", "device_id": "VT4GA35VS" }, | ||
], | ||
"m.devices": [...] | ||
} | ||
], | ||
"m.expires_ts": 1654616071686 | ||
} | ||
} | ||
``` | ||
|
||
### Choosing an SFU | ||
|
||
**TODO: How does a client discover SFUs** **TODO: Is SFU identified by just | ||
`user_id` or `(user_id, device_id)`?** | ||
|
||
* When initiating a group call, we need to decide which devices to actually talk | ||
to. | ||
* If the client has no SFU configured, we try to use the `m.foci` in the | ||
`m.call` event. | ||
* If there are multiple `m.foci`, we select the closest one based on | ||
latency, e.g. by trying to connect to all of them simultaneously and | ||
discarding all but the first call to answer. | ||
* If there are no `m.foci` in the `m.call` event, then we look at which foci | ||
in `m.call.member` that are already in use by existing participants, and | ||
select the most common one. (If the foci is overloaded it can reject us | ||
and we should then try the next most populous one, etc). | ||
* If there are no `m.foci` in the `m.call.member`, then we connect full | ||
mesh. | ||
* If subsequently `m.foci` are introduced into the conference, then we | ||
should transfer the call to them (effectively doing a 1:1->group call | ||
upgrade). | ||
* If the client does have an SFU configured, then we decide whether to use it. | ||
* If other conf participants are already using it, then we use it. | ||
* If there are other users from our homeserver in the conference, then we | ||
use it (as presumably they should be using it too) | ||
* If there are no other `m.foci` (either in the `m.call` or in the | ||
participant state) then we use it. | ||
* Otherwise, we save bandwidth on our SFU by not cascading and instead | ||
behaving as if we had no SFU configured. | ||
* We do not recommend that users utilise an SFU to hide behind for privacy, but | ||
instead use a TURN server, only providing relay candidates, rather than | ||
consuming SFU resources and unnecessarily mandating the presence of an SFU. | ||
|
||
### Initial offer/answer dance | ||
|
||
During the initial offer/answer dance, the client establishes a data-channel | ||
between itself and the SFU to use later for rapid signalling. | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Simulcast | ||
|
||
#### RTP munging | ||
|
||
#### vp8 munging | ||
|
||
### RTCP re-transmission | ||
|
||
### Data-channel messaging | ||
|
||
The client uses the established data channel connection to the SFU to perform | ||
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send | ||
keep-alive messages, metadata, cascade and perform re-negotiation. | ||
|
||
**TODO: It feels like these ought to be `m.` namespaced** **TODO: Why `op` | ||
instead of `type`?** **TODO: It feels like these ought to have `content` rather | ||
than being on the same layer** | ||
|
||
#### SDP Stream Metadata extension | ||
|
||
The client will be receiving multiple streams from the SFU and it will need to | ||
be able to distinguish them, this therefore build on | ||
SimonBrandner marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and | ||
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to | ||
provide the client with the necessary metadata. Some of the data-channel events | ||
include a `metadata` field including a description of the stream being sent | ||
either from the SFU to the client or from the client to the SFU. | ||
|
||
Other than mute information and stream purpose, the metadata includes video | ||
track resolution. The SFU may not be able to determine the resolution of the | ||
track itself but it does need to know for simulcast; therefore, we include this | ||
in the metadata. | ||
|
||
```json | ||
{ | ||
"streamId1": { | ||
"purpose": "m.usermedia", | ||
"audio_muted": false, | ||
"video_muted": true, | ||
"tracks": { | ||
"trackId1": { | ||
"width": 1920, | ||
"height": 1080 | ||
SimonBrandner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
}, | ||
"trackId2": {} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
#### Event types | ||
|
||
##### Subscribe | ||
|
||
This event is sent by the client to request a set of tracks. In the case of | ||
video tracks the client can also request a specific resolution of a given a | ||
track; this resolution is a resolution the client wishes to receive but the SFU | ||
may send a lower one due to bandwidth etc. | ||
SimonBrandner marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
If the user for example switches from "spotlight" (one large tile) to "grid" | ||
(multiple small tiles) view, it should also send this request to let the SFU | ||
know of the resolution change. | ||
|
||
```json | ||
{ | ||
"op": "subscribe", | ||
"start": [ | ||
"stream_id": "streamId1", | ||
"track_id": "trackId1", | ||
|
||
"width": 1920, | ||
"height": 1080 | ||
], | ||
} | ||
``` | ||
|
||
##### Unsubscribe | ||
|
||
```json | ||
{ | ||
"op": "unsubscribe", | ||
"stop": [ | ||
"stream_id": "streamId1", | ||
"track_id": "trackId1" | ||
], | ||
} | ||
``` | ||
|
||
##### Publish | ||
|
||
##### Unpublish | ||
|
||
##### Offer | ||
|
||
##### Answer | ||
|
||
##### Metadata | ||
|
||
```json | ||
{ | ||
"op": "metadata", | ||
"metadata": {...} // As specified in the Metadata section | ||
} | ||
``` | ||
|
||
##### Keep-alive | ||
|
||
```json | ||
{ | ||
"op": "alive" | ||
} | ||
``` | ||
|
||
##### Connect | ||
|
||
If a user is using their SFU in a call, it will need to know how to connect to | ||
other SFUs present in order to participate in the full-mesh of SFU traffic (if | ||
any). The client is responsible for doing this using the `connect` op. | ||
|
||
|
||
```json | ||
{ | ||
"op": "connect" | ||
// TODO: How should this look? | ||
} | ||
``` | ||
|
||
### Encryption | ||
|
||
When SFUs are on the media path, they will necessarily terminate the SRTP | ||
dbkr marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
traffic from the peer, breaking E2EE. To address this, we apply an additional | ||
end-to-end layer of encryption to the media using [WebRTC Encoded | ||
Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) | ||
(formerly Insertable Streams) via | ||
[SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/). | ||
|
||
In order to provide PFS, The symmetric key used for these streams from a given | ||
participating device is a megolm key. Unlike a normal megolm key, this is shared | ||
via `m.room_key` over Olm to the devices participating in the conference | ||
including an `m.call_id` and `m.room_id` field on the key to correlate it to the | ||
conference traffic, rather than using the `session_id` event field to correlate | ||
(given the encrypted traffic is SRTP rather than events, and we don't want to | ||
have to send fake events from all senders every time the megolm session is | ||
replaced). | ||
|
||
The megolm key is ratcheted forward for every SFrame, and shared with new | ||
SimonBrandner marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
participants at the current index via `m.room_key` over Olm as per above. When | ||
participants leave, a new megolm session is created and shared with all | ||
participants over Olm. The new session is only used once all participants have | ||
received it. | ||
|
||
## Potential issues | ||
|
||
The SFUs participating in a conference end up in a full mesh. Rather than | ||
inventing our own spanning-tree system for SFUs however, we should fix it for | ||
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or | ||
similar to decide what better-than-full-mesh topology to use. In practice, full | ||
mesh cascade between SFUs is probably not that bad (especially if SFUs only | ||
request the streams over the trunk their clients care about) - and on aggregate | ||
will be less obnoxious than all the clients hitting a single SFU. | ||
|
||
Too many foci will chew bandwidth due to full-mesh between them. In the worst | ||
case, if every use is on their own HS and picks a different foci, it degenerates | ||
to a full-mesh call (just server-side rather than client-side). Hopefully this | ||
shouldn't happen as you will converge on using a single SFU with the most | ||
clients, but need to check how this works in practice. | ||
|
||
SFrame mandates its own ratchet currently which is almost the same as megolm but | ||
not quite. Switching it out for megolm seems reasonable right now (at least | ||
until MLS comes along) | ||
|
||
## Alternatives | ||
|
||
An option would be to treat 1:1 (and full mesh) entirely differently to SFU | ||
based calling rather than trying to unify them. Also, it's debatable whether | ||
supporting full mesh is useful at all. In the end, it feels like unifying 1:1 | ||
and SFU calling is for the best though, as it then gives you the ability to | ||
trivially upgrade 1:1 calls to group calls and vice versa, and avoids | ||
maintaining two separate hunks of spec. It also forces 1:1 calls to take | ||
multi-stream calls seriously, which is useful for more exotic capture devices | ||
(stereo cameras; 3D cameras; surround sound; audio fields etc). | ||
|
||
### Cascading | ||
|
||
One option here is for SFUs to act as an AS and sniff the `m.call.member` | ||
traffic of their associated server, and automatically call any other `m.foci` | ||
which appear. (They don't need to make outbound calls to clients, as clients | ||
always dial in). | ||
|
||
## Security considerations | ||
|
||
Malicious users could try to DoS SFUs by specifying them as their foci. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
(by @HarHarLinks from #3401 (comment)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I learn more about this topic, foci seem to not be authenticated. |
||
|
||
SFrame E2EE may go horribly wrong if we can't send the new megolm session fast | ||
enough to all the participants when a participant leave (and meanwhile if we | ||
keep using the old session, we're technically leaking call media to the parted | ||
participant until we manage to rotate). | ||
|
||
Need to ensure there's no scope for media forwarding loops through SFUs. | ||
|
||
In order to authenticate that only legitimate users are allowed to subscribe to | ||
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and | ||
sniff the `m.call` events on their associated server, and only act on to-device | ||
`m.call.*` events which come from a user who is confirmed to be in the room for | ||
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to | ||
connect to the SFU without having the keys to decrypt the traffic, but this | ||
feature is desirable for non-E2EE confs and to stop bandwidth DoS) | ||
|
||
## Unstable prefixes | ||
|
||
We probably don't care for this for the data-channel? |
Uh oh!
There was an error while loading. Please reload this page.