Skip to content
Open
Changes from 5 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
57ccc48
Sticky Events
kegsay Sep 16, 2025
94b1a87
Remove prev_batch
kegsay Sep 16, 2025
50d76e6
Update 4354-sticky-events.md
kegsay Sep 16, 2025
3baf0d8
Update 4354-sticky-events.md
kegsay Sep 16, 2025
29e9bf7
Update 4354-sticky-events.md
kegsay Sep 16, 2025
b6e8159
Syntax
kegsay Sep 17, 2025
33ec282
Update proposals/4354-sticky-events.md
kegsay Sep 18, 2025
7725f74
Update proposals/4354-sticky-events.md
kegsay Sep 18, 2025
192c6b4
Update 4354-sticky-events.md
kegsay Sep 18, 2025
97c9c5b
Update 4354-sticky-events.md
kegsay Sep 19, 2025
8d101fd
Update 4354-sticky-events.md
kegsay Sep 19, 2025
c75e19c
Update 4354-sticky-events.md
kegsay Sep 19, 2025
c925a4c
Update 4354-sticky-events.md
kegsay Sep 19, 2025
6524be2
Update 4354-sticky-events.md
kegsay Sep 22, 2025
d14448c
Update 4354-sticky-events.md
kegsay Sep 22, 2025
ce37b02
Update 4354-sticky-events.md
kegsay Sep 22, 2025
caf3fcd
Update 4354-sticky-events.md
kegsay Sep 23, 2025
ba01efd
Update 4354-sticky-events.md
kegsay Sep 23, 2025
06d7aa5
Update 4354-sticky-events.md
kegsay Sep 24, 2025
b44ccaa
Update 4354-sticky-events.md
kegsay Sep 25, 2025
81cf728
Update 4354-sticky-events.md
kegsay Sep 25, 2025
eced090
Update 4354-sticky-events.md
kegsay Sep 25, 2025
cec1815
Update 4354-sticky-events.md
kegsay Sep 26, 2025
b94096a
Update 4354-sticky-events.md
kegsay Sep 26, 2025
b9ed93f
Move around k:v map bits to Addendum
kegsay Sep 26, 2025
b135726
Update 4354-sticky-events.md
kegsay Sep 26, 2025
8f0e3ce
Update proposals/4354-sticky-events.md
kegsay Sep 27, 2025
3c26e3b
Update 4354-sticky-events.md
kegsay Sep 29, 2025
71e83cb
Update 4354-sticky-events.md
kegsay Oct 1, 2025
b2eab83
Apply suggestions from code review
kegsay Oct 1, 2025
99ee9f8
Update 4354-sticky-events.md
kegsay Oct 1, 2025
3ff65a5
Update 4354-sticky-events.md
kegsay Oct 1, 2025
865746c
Update 4354-sticky-events.md
kegsay Oct 2, 2025
240d650
Update proposals/4354-sticky-events.md
kegsay Oct 2, 2025
434794d
Update 4354-sticky-events.md
kegsay Oct 7, 2025
6f94547
Update 4354-sticky-events.md
kegsay Oct 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
331 changes: 331 additions & 0 deletions proposals/4354-sticky-events.md
Copy link
Member

@turt2live turt2live Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation requirements:

  • Client (usage)
  • Client (receive/handle)
  • Server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementations for the client do not yet implement the latest versions of this MSC. This is currently in progress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Half-Shot which parts are those specifically? A review of the implementations appear to show it setting up things in a mostly-correct way. (I have no context on what transpired on this MSC between proto-MSC and now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were changes around 4-pules to 3-pules in the key mapping, and actually removing the requirement for the key mapping. This is now implemented in matrix-org/matrix-js-sdk#5028. I'm more happy that the SDK side is plausible now.

We have tested local calls with this MSC and it seems to work fine, but not federated calls. I don't actually see the need to block on federated calls myself, the application layer should be happy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3-pules have become 4-pules again so I'll need to check that this all still works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed to check in, this still works.

Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
# MSC4354: Sticky Events

MatrixRTC currently depends on [MSC3757](https://github.com/matrix-org/matrix-spec-proposals/pull/3757)
for sending per-user per-device state. MatrixRTC wants to be able to share a temporary state to all
users in a room to indicate whether the given client is in the call or not.

The concerns with MSC3757 and using it for MatrixRTC are mainly:

1. In order to ensure other users are unable to modify each other’s state, it proposes using
string packing for authorization which feels wrong, given the structured nature of events.
2. Allowing unprivileged users to send arbitrary amounts of state into the room is a potential
abuse vector, as these states can pile up and can never be cleaned up as the DAG is append-only.
3. State resolution can cause rollbacks. These rollbacks may inadvertently affect per-user per-device state.

Other proposals have similar problems such as live location sharing which uses state events when it
really just wants per-user last-write-wins behaviour.

There currently exists no good communication primitive in Matrix to send this kind of data. EDUs are
almost the right primitive, but:

* They can’t be sent via clients (there is no concept of EDUs in the Client-Server API\!
[MSC2477](https://github.com/matrix-org/matrix-spec-proposals/pull/2477) tries to change that)
* They aren’t extensible.
* They do not guarantee delivery. Each EDU type has slightly different persistence/delivery guarantees,
all of which currently fall short of guaranteeing delivery.

This proposal adds such a primitive, called Sticky Events, which provides the following guarantees:

* Eventual delivery (with timeouts) and convergence.
* Access control tied to the joined members in the room.
* Extensible, able to be sent by clients.

This new primitive can be used to implement MatrixRTC participation, live location sharing, among other functionality.

## Proposal

Message events can be annotated with a new top-level `sticky` key, which MUST have a `duration_ms`,
which is the number of milliseconds for the event to be sticky. The presence of `sticky.duration_ms`
with a valid value makes the event “sticky”[^stickyobj]. Valid values are the integer range 0-3600000 (1 hour).

```json
{
"type": "m.rtc.member",
"sticky": {
"duration_ms": 600000
},
"sender": "@alice:example.com",
"room_id": "!foo",
"origin_server_ts": 1757920344000,
"content": { ... }
}
```

This key can be set by clients in the CS API by a new query parameter `stick_duration_ms`, which is
added to the following endpoints:

* `PUT /_matrix/client/v3/rooms/{roomId}/send/{eventType}/{txnId}`
* `PUT /_matrix/client/v3/rooms/{roomId}/state/{eventType}/{stateKey}`

To calculate if any sticky event is still sticky:

* Calculate the start time:
* The start time is `min(now, origin_server_ts)`. This ensures that malicious origin timestamps cannot
specify start times in the future.
* If the event is pushed via `/send`, servers MAY use the current time as the start time. This minimises
the risk of clock skew causing the start time to be too far in the past. See “Potential issues \> Time”.
* Calculate the end time as `start_time + min(stick_duration_ms, 3600000)`.
* If the end time is in the future, the event remains sticky.

Sticky events are like normal message events and are authorised using normal PDU checks. They have the
following _additional_ properties:

* They are eagerly synchronised with all other servers.[^partial]
* They must appear in the `/sync` response.[^sync]
* The soft-failure checks MUST be re-evaluated when the membership state changes for a user with unexpired sticky events.[^softfail]

To implement these properties, servers MUST:

* Attempt to send all sticky events to all joined servers, whilst respecting per-server backoff times.
Large volumes of events to send MUST NOT cause the sticky event to be dropped from the send queue on the server.
* Ensure all sticky events are delivered to clients via `/sync` in a new section of the sync response,
regardless of whether the sticky event falls within the timeline limit of the request.
* When a new server joins the room, the server MUST attempt delivery of all sticky events immediately.
* Remember sticky events per-user, per-room such that the soft-failure checks can be re-evaluated.

When an event loses its stickiness, these properties disappear with the stickiness. Servers SHOULD NOT
eagerly synchronise such events anymore, nor send them down `/sync`, nor re-evaluate their soft-failure status.
Note: policy servers and other similar antispam techniques still apply to these events.

The new sync section looks like:

```json
{
"rooms": {
"join": {
"!726s6s6q:example.com": {
"account_data": { ... },
"ephemeral": { ... },
"state": { ... },
"timeline": { ... },
"sticky": {
"events": [
{
"sender": "@bob:example.com",
"type": "m.foo",
"sticky": {
"duration_ms": 300000
},
"origin_server_ts": 1757920344000,
"content": { ... }
},
{
"sender": "@alice:example.com",
"type": "m.foo",
"sticky": {
"duration_ms": 300000
},
"origin_server_ts": 1757920311020,
"content": { ... }
}
]
}
}
}
}
```

Over Simplified Sliding Sync, Sticky Events have their own extension `sticky_events`, which has the following response shape:

```json
{
"rooms": {
"!726s6s6q:example.com": {
"events": [{
"sender": "@bob:example.com",
"type": "m.foo",
"sticky": {
"duration_ms": 300000
},
"origin_server_ts": 1757920344000,
"content": { ... }
}]
}
}
}
```

Sticky messages MAY be sent in the timeline section of the `/sync` response, regardless of whether
or not they exceed the timeline limit[^ordering]. If a sticky event is in the timeline, it MAY be
omitted from the `sticky.events` section. This ensures we minimise duplication in the `/sync` response JSON.

Servers SHOULD rate limit sticky events over federation. If the rate limit kicks in, servers MUST
return a non-2xx status code from `/send` such that the sending server *retries the request* in order
to guarantee that the sticky event is eventually delivered. Servers MUST NOT silently drop sticky events
and return 200 OK from `/send`, as this breaks the eventual delivery guarantee.

These messages may be combined with [MSC4140: Delayed Events](https://github.com/matrix-org/matrix-spec-proposals/pull/4140)
to provide heartbeat semantics (e.g required for MatrixRTC). Note that the sticky duration in this proposal
is distinct from that of delayed events. The purpose of the sticky duration in this proposal is to ensure sticky events are cleaned up.

### Implementing a map

MatrixRTC relies on a per-user, per-device map of RTC member events. To implement this, this MSC proposes
a standardised mechanism for determining keys on sticky events, the `content.sticky_key` property:

```json
{
"type": "m.rtc.member",
"sticky": {
"duration_ms": 300000
},
"sender": "@alice:example.com",
"room_id": "!foo",
"origin_server_ts": 1757920344000,
"content": {
"sticky_key": "LAPTOPXX123",
...
}
}
```

`content.sticky_key` is ignored server-side[^encryption] and is purely informational. Clients which
receive a sticky event with a sticky key SHOULD keep a map with keys determined via the 4-uple
`(room_id, sender, type, content.sticky_key)` to track the current values in the map. Nothing stops
users sending multiple events with the same `sticky_key`. To deterministically tie-break, clients which
implement this behaviour MUST:

- pick the one with the highest `origin_server_ts`,
- tie break on the one with the highest lexicographical event ID (A < Z).

When overwriting keys, clients SHOULD use the same sticky duration as the previous sticky event to avoid clients diverging.
This can happen when a client sends a sticky event with key K with a long timeout, then overwrites it with the same key K’
with a short timeout. If the sticky event K’ fails to be sent to all servers before the short timeout is hit,
some clients will believe the state is K and others will have no state. This will only resolve once the long timeout is hit.

Note that encrypted sticky events will encrypt some parts of the 4-uple. An encrypted sticky event only exposes the room ID and sender to the server:

```json
{
"content": {
"algorithm": "m.megolm.v1.aes-sha2",
"ciphertext": "AwgCEqABubgx7p8AThCNreFNHqo2XJCG8cMUxwVepsuXAfrIKpdo8UjxyAsA50IOYK6T5cDL4s/OaiUQdyrSGoK5uFnn52vrjMI/+rr8isPzl7+NK3hk1Tm5QEKgqbDJROI7/8rX7I/dK2SfqN08ZUEhatAVxznUeDUH3kJkn+8Onx5E0PmQLSzPokFEi0Z0Zp1RgASX27kGVDl1D4E0vb9EzVMRW1PrbdVkFlGIFM8FE8j3yhNWaWE342eaj24NqnnWJ5VG9l2kT/hlNwUenoGJFMzozjaUlyjRIMpQXqbodjgyQkGacTEdhBuwAQ",
"device_id": "AAvTvsyf5F",
"sender_key": "KVMNIv/HyP0QMT11EQW0X8qB7U817CUbqrZZCsDgeFE",
"session_id": "c4+O+eXPf0qze1bUlH4Etf6ifzpbG3YeDEreTVm+JZU"
},
"origin_server_ts": 1757948616527,
"sender": "@alice:example.com",
"type": "m.room.encrypted",
"sticky": {
"duration_ms": 600000
},
"event_id": "$lsFIWE9JcIMWUrY3ZTOKAxT_lIddFWLdK6mqwLxBchk",
"room_id": "!ffCSThQTiVQJiqvZjY:matrix.org"
}
```

The decrypted event would contain the `type` and `content.sticky_key`.

## Potential issues

### Time

Servers who can’t maintain correct clock frequency may expire sticky events at a slightly slower/faster rate
than other servers. As the maximum timeout is relatively low, the total deviation is also reasonably low,
making this less problematic. The alternative of explicitly sending an expiration event would likely cause
more deviation due to retries than deviations due to clocks.

Servers with significant clock skew may set `origin_server_ts` too far in the past or future. If the value
is too far in the past this will cause sticky events to expire quicker than they should, or to always be
treated as expired. If the value is too far in the future, this has no effect as it is bounded by the current time.
As such, this proposal relies somewhat on NTP to ensure clocks over federation are roughly in sync.
As a consequence of this, the sticky duration SHOULD NOT be set to below 5 minutes.[^ttl]

### Encryption

Encrypted sticky events reduce reliability as in order for a sticky event to be visible to the end user it
requires *both* the sending client to think the receiver is joined (so we encrypt for their devices) and the
receiving server to think the sender is joined (so it passes auth checks). Unencrypted events only strictly
require the receiving server to think the sender is joined.

The lack of historical room key sharing may make some encrypted sticky events undecryptable when new users join the room.

### Spam

Servers may send every event as a sticky event, causing a higher amount of events to be sent eagerly over federation
and to be sent down `/sync` to clients. The former is already an issue as servers can simply `/send` many events.
The latter is a new abuse vector, as up until this point the `timeline_limit` would restrict the amount of events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded number of sticky events in Sliding Sync response

In the current Simplified Sliding Sync extension implementation in Synapse, it will return all unexpired sticky events in the room on initial sync. And is also unbounded for incremental syncs.

For example, it doesn't seem fine to return 100k+ sticky events. The problem is the amount of work and time needed to come up with the giant 100k response and the amount of network effort to get that back to the client. It's the same reason why we have a timeline_limit for timeline events. One may say that this won't happen because of rate-limits, etc but seems possible for a large public event room. Anyone can send a sticky event and we already have examples of large rooms.

Adding a limit

At a minimum, I think we need some limit and a new endpoint to paginate further.

Going further

We're tackling similar problems across a few MSC's now:

This feels like the exact same problem as the threads extension. In both of these cases, we are trying to tackle the same problem of gathering a subset of the events in the timeline. Here is what I previously wrote about it:

I think the better way to go about is to apply the same pattern that we just worked through with thread subscriptions. We would need a few things:

  • A way to return the new sticky events in /sync
  • Dedicated sticky_events_limited flag when there are other sticky events that we didn't return
  • A pagination endpoint for the sticky events

To be more detailed, these could be normal timeline events. If there are more sticky events in between the given sync token and the current position, we set the sticky_events_limited flag. For the pagination endpoint, we could overload /messages with a new filter.

-- @MadLittleMods, internal room

With some more thinking on the subject, the dedicated Sliding Sync extension worked well for thread_subscriptions because thread_subscriptions aren't part of the response already.

Whereas with sticky events and thread updates, they are already part of the timeline.

To break down the list of things needed (as listed above) and my current thinking:

"A way to return the new sticky events in /sync"

In the case of threads and sticky events, these events are already included as part of the timeline and /sync already keeps you up to date with all of the events.

"A pagination endpoint for the sticky events"

The dedicated [pagination endpoint] makes sense to back-paginate the gaps (whenever the timeline is limited: true)

"Dedicated sticky_events_limited flag when there are other sticky events that we didn't return"

With more thinking, I think this one is optional.

We could have an extension that indicates whether [sticky_events_limited] but this only saves a few extra requests in the cases where the timeline is limited: true but there weren't actually any new thread updates in the gap.

-- @MadLittleMods, #4360 (comment)


Previous piece-meal discussions:

that arrive on client devices (only state events are unbounded and setting state is a privileged operation).
This proposal has the following protections in place:

* All sticky events expire, with a hard limit of 1 hour. The hard limit ensures that servers cannot set years-long expiry times.
This ensures that the data in the `/sync` response can go down and not grow unbounded.
* All sticky events are subject to normal PDU checks, meaning that the sender must be authorised to send events into the room.
* Servers sending lots of sticky events may be asked to try again later as a form of rate-limiting.
Due to data expiring, subsequent requests will gradually have less data.

## Alternatives

### Use state events

We could do [MSC3757](https://github.com/matrix-org/matrix-spec-proposals/pull/3757), but for the
reasons mentioned at the start we don’t really want to do so.

### Make stickiness persistent not ephemeral

There are arguments that, at least for some use cases, we don’t want these sticky events to timeout.
However, that opens the possibility of bloating the `/sync` response with sticky events.

Suggestions for minimizing that have been to have a hard limit on the number of sticky events a user can have per room,
instead of a timeout. However, this has two drawbacks: a) you still may end up with substantial bloat as stale data doesn’t
automatically get reaped (even if the amount of bloat is limited), and b) what do clients do if there are already too many
sticky events? The latter is tricky, as deleting the oldest may not be what the user wants if it happens to be not-stale data,
and asking the user what data it wants to delete vs keep is unergonomic.

Non-expiring sticky events could be added later if the above issues are resolved.

### Have a dedicated ‘ephemeral user state’ section

Early prototypes of this proposal devised a key-value map with timeouts maintained over EDUs rather than PDUs.
This early proposal had much the same feature set as this proposal but with one major difference: equivocation.
Servers could broadcast different values for the same key to different servers, causing the map to not converge:
the Byzantine Broadcast problem. Matrix already has a data structure to agree on shared state: the room DAG.
As such, this led to the prototype to the current proposal. By putting the data into the DAG, other servers
can talk to each other via it to see if they have been told different values. When combined with a simple
conflict resolution algorithm (which works because there is [no need for coordination](https://arxiv.org/abs/1901.01930)),
this provides a way for clients to agree on the same values. Note that in practice this needs servers to *eagerly*
share forward extremities so servers aren’t reliant on unrelated events being sent in order to check for equivocation.
Currently, there is no mechanism for servers to express “these are my latest events, what are yours?” without actually sending another event.

## Security Considerations

Servers may equivocate over federation and send different events to different servers in an attempt to cause
the key-value map maintained by clients to not converge. Alternatively, servers may fail to send sticky events
to their own clients to produce the same outcome. Federation equivocation is mitigated by the events being
persisted in the DAG, as servers can talk to each other to fetch all events. There is no way to protect against
dropped updates for the latter scenario.

## Unstable Prefix

- The `stick_duration_ms` query param is `msc4354_stick_duration_ms`.
- The `sticky` key in the PDU is `msc4354_sticky`.
- The `/sync` response section is `msc4354_sticky_events`.
- The sticky key in the `content` of the PDU is `msc4354_sticky_key`.

[^stickyobj]: The presence of the `sticky` object alone is insufficient.
[^partial]: Over federation, servers are not required to send all timeline events to every other server.
Servers mostly lazy load timeline events, and will rely on clients hitting `/messages` which in turn
hits`/backfill` to request events from federated servers.
[^sync]: Normal timeline events do not always appear in the sync response if the event is more than `timeline_limit` events away.
[^softfail]: Not all servers will agree on soft-failure status due to the check considering the “current state” of the room.
To ensure all servers agree on which events are sticky, we need to re-evaluate this rule when the current room state changes.
This becomes particularly important when room state is rolled back. For example, if Charlie sends some sticky event E and
then Bob kicks Charlie, but concurrently Alice kicks Bob then whether or not a receiving server would accept E would depend
on whether they saw “Alice kicks Bob” or “Bob kicks Charlie”. If they saw “Alice kicks Bob” then E would be accepted. If they
saw “Bob kicks Charlie” then E would be rejected, and would need to be rolled back when they see “Alice kicks Bob”.
[^ordering]: Sticky events expose gaps in the timeline which cannot be expressed using the current sync API. If sync used
something like [stitched ordering](https://codeberg.org/andybalaam/stitched-order)
or [MSC3871](https://github.com/matrix-org/matrix-spec-proposals/pull/3871) then sticky events could be inserted straight
into the timeline without any additional section, hence “MAY” would enable this behaviour in the future.
[^encryption]: Previous versions of this proposal had the key be at the top-level of the event JSON so servers could
implement map-like semantics on client’s behalf. However, this would force the key to remain visible to the server and
thus leak metadata. As a result, the key now falls within the encrypted `content` payload, and clients are expected to
implement the map-like semantics should they wish to.
[^ttl]: Earlier designs had servers inject a new `unsigned.ttl_ms` field into the PDU to say how many milliseconds were left.
This was problematic because it would have to be modified every time the server attempted delivery of the event to another server.
Furthermore, it didn’t really add any more protection because it assumed servers honestly set the value.
Malicious servers could set the TTL to be 0 ~ `sticky.duration_ms` , ensuring maximum divergence
on whether or not an event was sticky. In contrast, using `origin_server_ts` is a consistent reference point
that all servers are guaranteed to see, limiting the ability for malicious servers to cause divergence as all
servers approximately track NTP.