matrix-org · SimonBrandner · Sep 25, 2022 · Sep 25, 2022 · Oct 2, 2022 · Oct 2, 2022
diff --git a/proposals/3898-sfu.md b/proposals/3898-sfu.md
@@ -0,0 +1,331 @@
+# MSC3898: Native Matrix VoIP signalling for cascaded SFUs
+
+[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
+specifies how full-mesh group calls work in Matrix. While that MSC works well
+for small group calls, it does not work so well for large conferences due to
+bandwidth (and other) issues.
+
+Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
+between peers (which could be clients or SFUs or both). To make use of them
+effectively, peers need to be able to tell the SFU which streams they want to
+receive at what resolutions.
+
+To solve the issue of centralization, the SFUs are also allowed to connect to
+each other ("cascade") and therefore the peers also need a way to tell an SFU to
+which other SFUs to connect.
+
+## Proposal
+
+**TODO: spell out how this works with active speaker detection & associated
+signalling** **TODO: spell out how the DC traffic interacts with
+application-layer traffic** **TODO: how do we prove to the SFU that we have the
+right to subscribe to track?**
+
+### Diagrams
+
+The diagrams of how this all looks can be found in
+[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).
+
+### State events
+
+#### `m.call` state event
+
+This MSC proposes adding an _optional_ `m.foci` field to the `m.call` state
+event. It as list of recommended SFUs that the call initiator can recommend to
+users who do not want to use their own SFU (because they don't have one, or
+because they would be the only person on their SFU for their call, and so choose
+to connect direct to save bandwidth).
+
+For instance:
+
+```json
+{
+    "type": "m.call",
+    "state_key": "cvsiu2893",
+    "content": {
+        "m.intent": "m.room",
+        "m.type": "m.voice",
+        "m.name": "Voice room",
+        "m.foci": [
+            { "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" },
+            { "user_id": "@sfu-nyc:matrix.org", "device_id": "VT4GA35VS" },
+        ],
+    }
+}
+```
+
+#### `m.call.member` state event
+
+This MSC proposes adding an _optional_ `m.foci` field to the `m.call.member`
+state event. It is used, if the user wants to be contacted via an SFU rather
+than called directly (either 1:1 or full mesh).
+
+For instance:
+
+```json
+{
+    "type": "m.call.member",
+    "state_key": "@matthew:matrix.org",
+    "content": {
+        "m.calls": [
+            {
+                "m.call_id": "cvsiu2893",
+                "m.foci": [
+                    { "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" },
+                    { "user_id": "@sfu-nyc:matrix.org", "device_id": "VT4GA35VS" },
+                ],
+                "m.devices": [...]
+            }
+        ],
+        "m.expires_ts":  1654616071686
+    }
+}
+```
+
+### Choosing an SFU
+
+**TODO: How does a client discover SFUs** **TODO: Is SFU identified by just
+`user_id` or `(user_id, device_id)`?**
+
+* When initiating a group call, we need to decide which devices to actually talk
+  to.
+  * If the client has no SFU configured, we try to use the `m.foci` in the
+    `m.call` event.
+    * If there are multiple `m.foci`, we select the closest one based on
+      latency, e.g. by trying to connect to all of them simultaneously and
+      discarding all but the first call to answer.
+    * If there are no `m.foci` in the `m.call` event, then we look at which foci
+      in `m.call.member` that are already in use by existing participants, and
+      select the most common one.  (If the foci is overloaded it can reject us
+      and we should then try the next most populous one, etc).
+    * If there are no `m.foci` in the `m.call.member`, then we connect full
+      mesh.
+    * If subsequently `m.foci` are introduced into the conference, then we
+      should transfer the call to them (effectively doing a 1:1->group call
+      upgrade).
+  * If the client does have an SFU configured, then we decide whether to use it.
+    * If other conf participants are already using it, then we use it.
+    * If there are other users from our homeserver in the conference, then we
+      use it (as presumably they should be using it too)
+    * If there are no other `m.foci` (either in the `m.call` or in the
+      participant state) then we use it.
+    * Otherwise, we save bandwidth on our SFU by not cascading and instead
+      behaving as if we had no SFU configured.
+* We do not recommend that users utilise an SFU to hide behind for privacy, but
+  instead use a TURN server, only providing relay candidates, rather than
+  consuming SFU resources and unnecessarily mandating the presence of an SFU.
+
+### Initial offer/answer dance
+
+During the initial offer/answer dance, the client establishes a data-channel
+between itself and the SFU to use later for rapid signalling.
+
+### Simulcast
+
+#### RTP munging
+
+#### vp8 munging
+
+### RTCP re-transmission
+
+### Data-channel messaging
+
+The client uses the established data channel connection to the SFU to perform
+low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
+keep-alive messages, metadata, cascade and perform re-negotiation.
+
+**TODO: It feels like these ought to be `m.` namespaced** **TODO: Why `op`
+instead of `type`?** **TODO: It feels like these ought to have `content` rather
+than being on the same layer**
+
+#### SDP Stream Metadata extension
+
+The client will be receiving multiple streams from the SFU and it will need to
+be able to distinguish them, this therefore build on
+[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
+[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
+provide the client with the necessary metadata. Some of the data-channel events
+include a `metadata` field including a description of the stream being sent
+either from the SFU to the client or from the client to the SFU.
+
+Other than mute information and stream purpose, the metadata includes video
+track resolution. The SFU may not be able to determine the resolution of the
+track itself but it does need to know for simulcast; therefore, we include this
+in the metadata.
+
+```json
+{
+    "streamId1": {
+        "purpose": "m.usermedia",
+        "audio_muted": false,
+        "video_muted": true,
+        "tracks": {
+            "trackId1": {
+                "width": 1920,
+                "height": 1080
+            },
+            "trackId2": {}
+        }
+    }
+}
+```
+
+#### Event types
+
+##### Subscribe
+
+This event is sent by the client to request a set of tracks. In the case of
+video tracks the client can also request a specific resolution of a given a
+track; this resolution is a resolution the client wishes to receive but the SFU
+may send a lower one due to bandwidth etc.
+
+If the user for example switches from "spotlight" (one large tile) to "grid"
+(multiple small tiles) view, it should also send this request to let the SFU
+know of the resolution change.
+
+```json
+{
+    "op": "subscribe",
+    "start": [
+        "stream_id": "streamId1",
+        "track_id": "trackId1",
+        "width": 1920,
+        "height": 1080
+    ],
+}
+```
+
+##### Unsubscribe
+
+```json
+{
+    "op": "unsubscribe",
+    "stop": [
+        "stream_id": "streamId1",
+        "track_id": "trackId1"
+    ],
+}
+```
+
+##### Publish
+
+##### Unpublish
+
+##### Offer
+
+##### Answer
+
+##### Metadata
+
+```json
+{
+    "op": "metadata",
+    "metadata": {...} // As specified in the Metadata section
+}
+```
+
+##### Keep-alive
+
+```json
+{
+    "op": "alive"
+}
+```
+
+##### Connect
+
+If a user is using their SFU in a call, it will need to know how to connect to
+other SFUs present in order to participate in the full-mesh of SFU traffic (if
+any). The client is responsible for doing this using the `connect` op.
+
+```json
+{
+    "op": "connect"
+    // TODO: How should this look?
+}
+```
+
+### Encryption
+
+When SFUs are on the media path, they will necessarily terminate the SRTP
+traffic from the peer, breaking E2EE. To address this, we apply an additional
+end-to-end layer of encryption to the media using [WebRTC Encoded
+Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md)
+(formerly Insertable Streams) via
+[SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).
+
+In order to provide PFS, The symmetric key used for these streams from a given
+participating device is a megolm key. Unlike a normal megolm key, this is shared
+via `m.room_key` over Olm to the devices participating in the conference
+including an `m.call_id` and `m.room_id` field on the key to correlate it to the
+conference traffic, rather than using the `session_id` event field to correlate
+(given the encrypted traffic is SRTP rather than events, and we don't want to
+have to send fake events from all senders every time the megolm session is
+replaced).
+
+The megolm key is ratcheted forward for every SFrame, and shared with new
+participants at the current index via `m.room_key` over Olm as per above.  When
+participants leave, a new megolm session is created and shared with all
+participants over Olm.  The new session is only used once all participants have
+received it.
+
+## Potential issues
+
+The SFUs participating in a conference end up in a full mesh. Rather than
+inventing our own spanning-tree system for SFUs however, we should fix it for
+Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
+similar to decide what better-than-full-mesh topology to use. In practice, full
+mesh cascade between SFUs is probably not that bad (especially if SFUs only
+request the streams over the trunk their clients care about) - and on aggregate
+will be less obnoxious than all the clients hitting a single SFU.
+
+Too many foci will chew bandwidth due to full-mesh between them. In the worst
+case, if every use is on their own HS and picks a different foci, it degenerates
+to a full-mesh call (just server-side rather than client-side).  Hopefully this
+shouldn't happen as you will converge on using a single SFU with the most
+clients, but need to check how this works in practice.
+
+SFrame mandates its own ratchet currently which is almost the same as megolm but
+not quite.  Switching it out for megolm seems reasonable right now (at least
+until MLS comes along)
+
+## Alternatives
+
+An option would be to treat 1:1 (and full mesh) entirely differently to SFU
+based calling rather than trying to unify them. Also, it's debatable whether
+supporting full mesh is useful at all. In the end, it feels like unifying 1:1
+and SFU calling is for the best though, as it then gives you the ability to
+trivially upgrade 1:1 calls to group calls and vice versa, and avoids
+maintaining two separate hunks of spec.  It also forces 1:1 calls to take
+multi-stream calls seriously, which is useful for more exotic capture devices
+(stereo cameras; 3D cameras; surround sound; audio fields etc).
+
+### Cascading
+
+One option here is for SFUs to act as an AS and sniff the `m.call.member`
+traffic of their associated server, and automatically call any other `m.foci`
+which appear.  (They don't need to make outbound calls to clients, as clients
+always dial in).
+
+## Security considerations
+
+Malicious users could try to DoS SFUs by specifying them as their foci.
+
+SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
+enough to all the participants when a participant leave (and meanwhile if we
+keep using the old session, we're technically leaking call media to the parted
+participant until we manage to rotate).
+
+Need to ensure there's no scope for media forwarding loops through SFUs.
+
+In order to authenticate that only legitimate users are allowed to subscribe to
+a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
+sniff the `m.call` events on their associated server, and only act on to-device
+`m.call.*` events which come from a user who is confirmed to be in the room for
+that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
+connect to the SFU without having the keys to decrypt the traffic, but this
+feature is desirable for non-E2EE confs and to stop bandwidth DoS)
+
+## Unstable prefixes
+
+We probably don't care for this for the data-channel?