-
Notifications
You must be signed in to change notification settings - Fork 412
MSC4186: Simplified Sliding Sync #4186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
8cc34df
456db3b
6fe4ba7
a5dc74b
4c47844
3fc851a
a1bd6bf
839eb35
a9141ab
dbf593a
a325def
992007a
165aaff
495963f
c6cd26e
486efe2
97feb73
1786eeb
d156cb9
de0fe55
b7b363e
2d2890d
b405735
888e070
d6d5edd
5b5e82b
a2d3684
4a38bb9
1465e25
0fbd845
ce6f0f3
985e71f
492db05
1058ad7
bade6bb
474e681
6e96ddb
2db8940
cc759db
98307d1
ca38ae8
e5b047a
a48eb1b
25479be
ac356a7
31d18c1
bbc82d2
49bbe41
c4bf57d
74e8e76
2cf4783
f5f6f83
2e8be0a
ef197a4
5bcef01
5bfe61c
3462215
f90fcf5
6496470
b255bf3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
erikjohnston marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,392 @@ | ||||||
# MSC4186: Simplified Sliding Sync | ||||||
|
||||||
The current `/sync` endpoint scales badly as the number of rooms on an account increases. It scales badly because all | ||||||
rooms are returned to the client, incremental syncs are unbounded and slow down based on how long the user has been | ||||||
offline, and clients cannot opt-out of a large amount of extraneous data such as receipts. On large accounts with | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can't data be removed with filters? (See There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specifically, I believe you can opt-out of receipts with One of the difficulties with v3 sync filtering in general is that performance can vary wildly wildly depending on server implementation details. The spec doesn't indicate which fast-paths should exist, and whether or not you hit a fast-path can also affect how many events are returned in the sync window (matrix-org/matrix-spec#1887). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this doesn't need to block the MSC, but indeed, "clients cannot opt-out of ... receipts" appears to be factually incorrect. |
||||||
thousands of rooms, the initial sync operation can take tens of minutes to perform. This significantly delays the | ||||||
initial login to Matrix clients, and also makes incremental sync very heavy when resuming after any significant pause in | ||||||
usage. | ||||||
|
||||||
Note: this is a “simplified” version of the sliding sync API proposed in | ||||||
[MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575), based on paring back that API based | ||||||
on real world use cases and usages. | ||||||
|
||||||
|
||||||
# Goals | ||||||
|
||||||
This improved `/sync` mechanism has a number of goals: | ||||||
|
||||||
- Sync time should be independent of the number of rooms you are in. | ||||||
- Time from opening of the app (when already logged in) to confident usability should be as low as possible. | ||||||
- Time from login on existing accounts to usability should be as low as possible. | ||||||
- Bandwidth should be minimized. | ||||||
- Support lazy-loading of things like read receipts (and avoid sending unnecessary data to the client) | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
- Support informing the client when room state changes from under it, due to state resolution. | ||||||
- Clients should be able to work correctly without ever syncing in the full set of rooms they’re in. | ||||||
- Don’t incremental sync rooms you don’t care about. | ||||||
- Servers should not need to store all past since tokens. If a since token has been discarded we should gracefully | ||||||
degrade to initial sync. | ||||||
|
||||||
These goals shaped the design of this proposal. | ||||||
|
||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
# Proposal | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does SSS interact with the ignored user list? In /v3/sync, the server omits events sent by ignored users, requiring the client to perform a second initial sync if they want to retroactively see messages when unignoring. The choice to do server-side ignored user filtering in sync also doesn't really simplify client implementation, because you still need client-side filtering for events received before the user was ignored. My preference would be to drop this requirement entirely for SSS, now that we have a chance to do it in a backwards-compatible way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [This thread needs resolution] |
||||||
|
||||||
The core differences between sync v2 and simplified sliding sync are: | ||||||
|
||||||
- The server initially only sends the most recent N rooms to the client (where N is specified by the client), which then | ||||||
can paginate in older rooms in subsequent requests | ||||||
- The client can configure which information the server will return for different sets of rooms (e.g. a smaller timeline | ||||||
limit for older rooms). | ||||||
- The client can filter what rooms it is interested in | ||||||
- The client can maintain multiple sync loops (with some caveats) | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
- This is useful for e.g. iOS clients which have a separate process to deal with notifications, as well as allowing | ||||||
the app to split handling of things like encryption entirely from room data. | ||||||
|
||||||
The basic operation is similar between sync v2 and simplified sliding sync: both use long-polling with tokens to fetch | ||||||
updates from the server. I.e., the basic operation of both APIs is to do an “initial” request and then repeatedly call | ||||||
the API supplying the token returned in the previous response in the subsequent “incremental” request. | ||||||
|
||||||
|
||||||
|
||||||
## Lists and room subscriptions | ||||||
|
||||||
The core component of a sliding sync request is “lists”, which specify what information to return about which rooms. | ||||||
Each list specifies some filters on rooms (e.g. ignore spaces), the range of filtered rooms to select (e.g. the most | ||||||
recent 20 filtered rooms), and the config for the data to return for those rooms (e.g. the required state, timeline | ||||||
limit, etc). The order of rooms is always done based on when the server received the most recent event for the room. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
The client can also specify config for specific rooms if it has their room ID, these are known as room subscriptions. | ||||||
|
||||||
Multiple lists and subscriptions can be specified in a request. If a room matches multiple lists/subscriptions then the | ||||||
config is “combined” to be the superset of all configs (e.g. take the maximum timeline limit). See below for the exact | ||||||
algorithm. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
The server tracks what data has been sent to the client in which rooms. If a room matches a list or subscription that | ||||||
hasn’t been sent down before, then the server will respond with the full metadata about the room indicated by `initial: | ||||||
true`. If a room stops matching a list (i.e. it falls out of range) then no further updates will be sent until it starts | ||||||
matching a list again, at which point the missing updates (limited by the `timeline_limit`) will be sent down. However, | ||||||
as clients are now expected to paginate all rooms in the room list in the background (in order to correctly order and | ||||||
search them), the act of a room falling out of range is a temporary edge-case. | ||||||
|
||||||
|
||||||
## Pagination | ||||||
|
||||||
Pagination is achieved by the client increasing the ranges of one (or more) lists. | ||||||
|
||||||
For example an initial request might have a list called `all_rooms` specifying a range of `0..20` in the initial | ||||||
request, and the server will respond with the top 20 rooms (by most recently updated). On the second request the client | ||||||
may change the range to `0..100`, at which point the server will respond with the top 100 rooms that either a) weren’t | ||||||
sent down in the first request, or b) have updates since the first request. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
Clients can increase and decrease the ranges as they see fit. A common approach would be to start with a small window | ||||||
and grow that until the range covers all the rooms. After some threshold of the app being offline it may reduce the | ||||||
range back down and incrementally grow it again. This allows for ensuring that a limited amount of data is requested at | ||||||
once, to improve response times. | ||||||
|
||||||
|
||||||
## Connections | ||||||
|
||||||
Clients can have multiple “connections” (i.e. sync loops) with the server, so long as each connection has a different | ||||||
`conn_id` set in the request. | ||||||
|
||||||
Clients must only have a single request in-flight at any time per connection (clients can have multiple connections by | ||||||
specifying a unique `conn_id`). If a client needs to send another request before receiving a response to an in-flight | ||||||
request (e.g. for retries or to change parameters) the client *must* cancel the in-flight request (at the HTTP level) | ||||||
and *not* process any response it receives for it. | ||||||
|
||||||
In particular, a client must use the returned `pos` value in a response as the `since` param in exactly one request that | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
the client will process the response for. Clients must be careful to ensure that when processing a response any new | ||||||
requests use the new `pos`, and any in-flight requests using an old `pos` are canceled. | ||||||
|
||||||
The server cannot assume that a client has received a response until it receives a new request with the `since` token | ||||||
set to the `pos` it returned in the response. The server must ensure that any per-connection state it tracks correctly | ||||||
handles receiving multiple requests with the same `since` token (e.g. the client retries the request or decides to | ||||||
cancel and resend a request with different parameters). | ||||||
|
||||||
A server may decide to “expire” connections, either to free resources or because the server thinks it would be faster | ||||||
for the client to start from scratch (e.g. because there are many updates to send down). This is done by responding with | ||||||
a 400 HTTP status and an error code of `M_UNKNOWN_POS`. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
|
||||||
## List configuration | ||||||
|
||||||
**TODO**, these are the same as in [MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575): | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
- Required state format | ||||||
- The filters | ||||||
- Lazy loading of members | ||||||
- Combining room config | ||||||
|
||||||
|
||||||
## Room config changes | ||||||
|
||||||
When a room comes in and out of different lists or subscriptions, the effective `timeline_limit` and `required_state` | ||||||
parameters may change. This section outlines how the server should handle these cases. | ||||||
|
||||||
If the `timeline_limit` *increases* then the server *may* choose to send down more historic data. This is to support the | ||||||
ability to get more history for certain rooms, e.g. when subscribing to the currently visible rooms in the list to | ||||||
precache their history. This is done by setting `unstable_expanded_timeline` to true and sending down the last N events | ||||||
(this may include events that have already been sent down). The server may choose not to do this if it believes it has | ||||||
already sent down the appropriate number of events. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
If new entries are added to `required_state` then the server must send down matching current state events. | ||||||
|
||||||
|
||||||
## Extensions | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that this MSC doesn't have any support for timeline filtering. The SS MSC called this out explicitly, and it would probably be good to do that here as well. Timeline filtering on /v3/sync is used heavily by some bot and tool clients. |
||||||
|
||||||
We anticipate that as more features land in Matrix, different kinds of data will also want to be synced to clients. Sync | ||||||
v2 did not have any first-class support to opt-in to new data. Sliding Sync does have support for this via "extensions". | ||||||
Extensions also allow this proposal to be broken up into more manageable sections. Extensions are requested by the | ||||||
client in a dedicated extensions block. | ||||||
|
||||||
|
||||||
In an effort to reduce the size of this proposal, extensions will be done in separate MSCs. There will be extensions | ||||||
for: | ||||||
|
||||||
- To Device Messaging \- MSC3885 | ||||||
- End-to-End Encryption \- MSC3884 | ||||||
- Typing Notifications \- MSC3961 | ||||||
- Receipts \- MSC3960 | ||||||
- Presence \- presence in sync v2: spec | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
- Account Data \- account\_data in sync v2: MSC3959 | ||||||
- Threads | ||||||
|
||||||
**TODO** explain how these interact with the room lists, this is the same as in | ||||||
[MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575) | ||||||
|
||||||
|
||||||
## Request format | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
```javascript | ||||||
{ | ||||||
"conn_id": "<conn_id>", // Client chosen ID of the connection, c.f. "Connections" | ||||||
|
||||||
|
||||||
// The set of room lists | ||||||
"lists": { | ||||||
// An arbitrary string which the client is using to refer to this list for this connection. | ||||||
"<list-name>": { | ||||||
|
||||||
// Sliding window ranges, c.f. Lists and room subscriptions | ||||||
"ranges": [[0, 10]], | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
// Filters to apply to the list. | ||||||
"filters": { | ||||||
// Flag which only returns rooms present (or not) in the DM section of account data. | ||||||
// If unset, both DM rooms and non-DM rooms are returned. If false, only non-DM rooms | ||||||
// are returned. If true, only DM rooms are returned. | ||||||
"is_dm": true|false|null, | ||||||
|
||||||
// Flag which only returns rooms which have an `m.room.encryption` state event. If unset, | ||||||
// both encrypted and unencrypted rooms are returned. If false, only unencrypted rooms | ||||||
// are returned. If true, only encrypted rooms are returned. | ||||||
"is_encrypted": true|false|null, | ||||||
|
||||||
// Flag which only returns rooms the user is currently invited to. If unset, both invited | ||||||
// and joined rooms are returned. If false, no invited rooms are returned. If true, only | ||||||
// invited rooms are returned. | ||||||
"is_invite": true|false|null, | ||||||
|
||||||
|
||||||
// If specified, only rooms where the `m.room.create` event has a `type` matching one | ||||||
// of the strings in this array will be returned. If this field is unset, all rooms are | ||||||
// returned regardless of type. This can be used to get the initial set of spaces for an account. | ||||||
// For rooms which do not have a room type, use 'null' to include them. | ||||||
"room_types": [ ... ], | ||||||
|
||||||
// Same as "room_types" but inverted. This can be used to filter out spaces from the room list. | ||||||
// If a type is in both room_types and not_room_types, then not_room_types wins and they are | ||||||
// not included in the result. | ||||||
"not_room_types": [ ... ], | ||||||
}, | ||||||
|
||||||
// The maximum number of timeline events to return per response. | ||||||
"timeline_limit": 10, | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
// Required state for each room returned. An array of event type and state key tuples. | ||||||
// Elements in this array are ORd together to produce the final set of state events | ||||||
// to return. One unique exception is when you request all state events via ["*", "*"]. When used, | ||||||
// all state events are returned by default, and additional entries FILTER OUT the returned set | ||||||
// of state events. These additional entries cannot use '*' themselves. | ||||||
// For example, ["*", "*"], ["m.room.member", "@alice:example.com"] will _exclude_ every m.room.member | ||||||
// event _except_ for @alice:example.com, and include every other state event. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
// In addition, ["*", "*"], ["m.space.child", "*"] is an error, the m.space.child filter is not | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
// required as it would have been returned anyway. | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
"required_state": [ ... ], | ||||||
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
} | ||||||
}, | ||||||
|
||||||
// The set of room subscriptions | ||||||
"room_subscriptions": { | ||||||
// The key is the room to subscribe to. | ||||||
"!foo:example.com": { | ||||||
|
The server MUST ensure that user has permission to see any information the server returns. However, the user need not be | |
in the room, e.g.clients can specify room IDs for world-readable rooms and they would be returned. |
This doesn't completely address the other questions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The world_readable
part of the MSC was just removed as well
Everything needs clarification again.
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was an effort to standardize pagination terminology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently SCT is leaning towards: from
and next_batch
instead of pos
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought about it a little more, I'm not sure that next_batch
as a name really fits with how the API works. It does match pagination APIs, which is a plus, but not totally convinced about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, the relevant section of the spec is here. It's worth noting that the same section of the spec mandates that next_batch
be a query-parameter, whilst apparently we have deliberately made pos
be a body parameter, for CORS reasons.
More generally, I feel that this is sufficiently different from regular pagination that the arguments for using the same terminology are weak. In essence: with "pagination" the expectation is that the client is working through a relatively static set of data; here the reason that the server did not return more data was not that there was too much of it, but rather that the data did not exist yet.
Considering a change to from
and next_batch
: I am -0. I'd not block such a change, but I don't think it's necessary.
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
MadLittleMods marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the heroes
membership state events need to be included in required_state
response when requesting required_state: ["m.room.member", "$LAZY"]
(lazy-loading room members)?
Sync v2 says this:
When lazy-loading room members is enabled, the membership events for the heroes MUST be included in the
state
, unless they are redundant. When the list of users changes, the server notifies the client by sending a fresh list of heroes. If there are no changes since the last sync, this field may be omitted.
But I think that's because m.heroes
in Sync v2 is only a list of user ID's. Whereas, in the sliding sync response here, we already have all of the info necessary in these stripped events.
One alternative is to match Sync v2 and only list the user ID's in heroes
and include the membership events in required_state
.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One weird corner with the current spec: If the room doesn't have a name set and the user is invited/knock to the room, we don't have any heroes
to provide here. This is because heroes
membership isn't included in the stripped state that the server receives when they are invited/knocked over federation.
Related to matrix-org/matrix-spec#380
This was solved for partial-joins via MSC3943
Not necessarily a blocker but just calling out something that won't work completely until that spec issue is solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-link #4319 (comment)
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be a knock_servers
field, copying #4233
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// An opaque integer that can be used to sort the rooms by "Bump Stamp" | |
// An opaque unsigned 64-bits integer that can be used to sort the rooms by "Bump Stamp" |
We may also precise that bump_stamp
has a total order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u64 would be a problem, or at least an inconvenience, for JavaScript clients. Should this be an unsigned 53-bit integer instead?
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unread and highlight count even be here? Notification/highlight count can't even be done correctly for encrypted rooms and it's just extra work for us to calculate if clients ignore our flawed values anyway.
We've just left them as dummy values in Synapse, see https://github.com/element-hq/synapse/blob/0e3c0aeee833e52121b3167de486dff34018ab27/synapse/handlers/sliding_sync/__init__.py#L1332-L1336
Related discussion: element-hq/synapse#17546 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they can be done in unencrypted rooms so would save a bunch of processing time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is misleading to keep these values here. A client has to compute these values for encrypted rooms anyway. There is no extra exotic costs to do it for unencrypted rooms client-side. It would save computations for the server though, which is a good thing.
Consistent responses. No misleading values that can be misused. I think removing these from the server is a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversation also continued at #4186 (comment)
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also describe how to handle the scenario where you're requesting rooms in the range 0..20
for an incremental sync and there are more than 20
rooms with updates:
Clients can tell whether there is more to paginate if they compare the bump_stamp
of the last room in the window to what they previously saw. They would need to expand the window until they see an unchanged room or room they haven't seen before. Instead of using bump_stamp
, clients could use the last event_id
in the timeline to compare if they want to make sure they've seen all events and not just the bumped things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related info added to the MSC:
matrix-spec-proposals/proposals/4186-simplified-sliding-sync.md
Lines 560 to 561 in 985e71f
The client can keep increasing the list range in increments to pull in the full list of rooms. The client uses the | |
returned `count` for the list to know when to stop expanding the list. |
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a lot simpler. Why do clients want to backfill multiple rooms at once?
Would adding a flag to /mesages
to not backfill be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a UX perf thing: when a user opens the app they will likely click one of the top 10 or 20 rooms and so having those 10 or 20 rooms have pre-fetched history as quickly as possible reduces the number of times that a user will see a blank room when opened.
It's hard for e.g. web to do multiple concurrent /messages
requests due to max in-flight request limits. Though certainly adding a flag to /messages
to make it return quicker would help a bunch (and I think is being proposed in an MSC somewhere already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Previous discussion: #4186 (comment)
Would adding a flag to
/messages
to not backfill be useful?
Yes, although may be slightly unrelated to Sliding Sync itself. I think the original point of this comment is just trying to hint at being able to get more messages without delay (from backfilling events over federation).
👍 This also plays nicely with other ideas being explored in MSC3871 (gaps) and MSC4282 (conditional backfill in
/messages
), see #4282 (comment)
which points to this idea ->
For the hack day, I'll be tackling the backfill problem with this strategy:
- Default to fast responses with gaps: As a default, we can always respond quickly and indicate gaps (MSC3871) for clients to paginate at their leisure.
- Fast back-pagination: Clients back-paginate with
/messages?dir=b&backfill=false
, and Synapse skips backfilling entirely, returning only local history with gaps as necessary.- Explicit gap filling: To fill in gaps, clients use
/messages?dir=b&backfill=true
which works just like today to do a best effort backfill.This allows the client to back-paginate the history we already have without delay. And can fill in the gaps as they see fit.
Gaps can be represented in the UI with a message like "We failed to get some messages in this gap, try again 🗘.", giving users clear feedback. Regardless of clients trying to fill in gaps automatically, I would still suggest to display gaps so people can tell what's happening.
This is basically a simplified version of MSC4282 leveraging MSC3871: Gappy timelines to get proper client feedback to indicate where the gaps are so we can skip backfill without worrying. For reference, skipping backfill without letting clients know where the gaps are just means they won't ever know that they are missing messages.
-- @MadLittleMods, https://github.com/element-hq/how-hard-can-it-be-2025/issues/47#issuecomment-3234339497
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connection exhaustion makes sense, and I guess having a separate API that is bulk /messages is essentially this API so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see #4186 (comment) that describes another alternative for removing expanded_timeline
. We don't even need to add a new bulk /messages
endpoint. The client can just do another initial Sliding Sync request with the increased timeline_limit
to get all of the messages that they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would adding a flag to
/mesages
to not backfill be useful?
#4282 discusses this idea, fwiw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in element-hq/synapse#17579 (comment), another alternative would be to just use another initial sync request with an increased timeline_limit
. The linked comment describes more details for client implementations as well.
Cross-link to the main discussion about moving away from expanded_timeline
-> #4186 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it needs to be incorporated into the MSC? What are the downsides of this? It seems fairly elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it as an alternative. I think the main downsides are: a) duplication of data (e.g state), and b) only supports the one use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) duplication of data (e.g state)
The goal is to get timeline
events so you would just make an initial Sliding Sync request without any required_state
:
Example initial sync request
Request body:
{
"conn_id": "fetch_timeline",
"lists": {
"foo-list": {
"range": [0, 99],
"required_state": {},
"timeline_limit": 20,
"filters": {
"is_dm": true
},
}
},
"room_subscriptions": {
"!sub1:bar": {
"required_state": {},
"timeline_limit": 20,
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only supports the one use case.
What are the other use cases not covered?
expanded_timeline
's goal is to get more timeline
messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seamless timeline
But doing an initial sync later means having to try and tie timelines together, e.g. if you do an initial sync for a room at the same time as another events comes down your sync loop, you need to make sure that that event does appear in the timeline and doesn't get "overwritten" by the initial sync (which may not include it). This doesn't feel particularly feasible to do reliably.
I already have notes about how to do this in element-hq/synapse#17579 (comment) ->
While this describes a potential implementation for Element X which doesn't have offline support and throws away events, I think the same thing works generally.
For the timeline stitching logic, the client can store the latest event in timeline before our initial sync, then find that event in the initial sync
timeline
events and spread backwards from that point. That way, the ongoing sync loop can still append to the end of the timeline and continue seamlessly.So if we have a timeline
[103]
already on the client, we storelatest_event_id = 103
, do our initial sync which returns[100, 101, 102, 103, 104]
and we splice/spread in only[100, 101, 102, 103]
accordingly (drop any new events after thelatest_event_id
from the initial sync response). This makes it so that even if the ongoing sync loop sends104
before or after our initial sync does, it still appends like normal and everything is in seamless order.If there are so many new messages sent in the time between us storing the
latest_event_id
and the initial sync responding that we now have a gap, we can just throw away our initial sync events because we have enough events to fill up the timeline just from our normal ongoing sync loop.
To be clear, the client doesn't need to be fancy about stitching:
If the client had more timeline like
[98, 99, 100, 101, 102, 103]
, we storelatest_event_id = 103
, we start the initial sync, our ongoing sync loop races us and returns104
which makes our timeline look like[98, 99, 100, 101, 102, 103, 104]
. Then our initial sync responds with[100, 101, 102, 103, 104]
, we find the103
spot in the response to slice at and place it at the103
spot in the client timeline leaving us with[100, 101, 102, 103, 104]
Pseudo code (maybe off-by-one errors):
latest_event_id = 103 # do initial sync request initial_sync_timeline = [100, 101, 102, 103, 104] event_index_in_response = initial_sync_timeline.index(latest_event_id) # Skip if we can't find the `latest_event_id` in the response. # This means there have been so many messages sent between the time we initially # made the initial sync and the response that this is no longer relevant. # We already have enough events to fill up the timeline from the normal # ongoing sync loop if event_index_in_response is None: return event_index_in_client_timeline = client_timeline.index(latest_event_id) # Update the timeline client_timeline = initial_sync_timeline[0:event_index_in_response] + client_timeline[event_index_in_client_timeline:-1]
By the nature of how expanded_timeline
works ("send down the latest N
events, even if some of those events have previously been sent"), duplicate events are an expected scenario for clients to handle. So handling seamless timeline is a pre-requisite for clients either way.
If you're referring to the problems that come from differing orders from /sync
(stream_ordering
) and /messages
(topological ordering), this is the exact same problem regardless of how we accomplish things (especially for offline-first clients that persist events).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded_timeline
complication
Are there any major downsides to timeline expansion? There is a lot of talk about how you could get nearly the same sort of effect via other mechanisms, but I don't see much concrete arguments against. Certainly it does complicate servers, but clients can entirely opt-out of using the mechanism.
Basically, these same arguments were made in favor of timeline trickling even when it was twice as bad trying to overload the limited
and initial
fields. We now have a dedicated expanded_timeline
field to turn this behavior from evil to baseline acceptable.
But we also have alternatives to avoid the complication altogether.
Clients can't opt out. With expanded_timeline
, a client who wants to increase timeline_limit
for whatever reason must take into account expanded_timeline
to know whether they are seeing historical messages (no longer possible to be a dumb naive client).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even after re-reading the text of the MSC a few times, I don't understand the goal of the expanded_timeline
flag; does it mean that there might be duplicates? As noted in a previous comment in this thread, clients have to handle duplicates anyways (b/o the timeline_limit
increasing use case to start with, and then to deduplicate results from /messages which used previously persisted pagination tokens). Maybe the expanded_timeline
flag is a hint?
Overall, I'd love to see more guidance for clients, with respect to when is it better to expand the timeline_limit
, or when is it better to use room/ROOM_ID/messages
for back-paginations. For instance, if a client wants to compute the unread status of a room, then, because of e2ee rooms, they may have to back-paginate up to the previous read receipt target event (if there's a gap in the timeline). In this case: should we rather use room/ROOM_ID/messages
(and send one request for each room), or use custom lists with an expanded timeline_limit
(the naivest way being increasing the timeline_limit
for all rooms)?
(I want to note the cross-room /messages
endpoints, which could take multiple per-room pagination tokens, would be a nice addition, but I do understand it's a big change and would rather see it become its own MSC.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the timeline stitching logic, the client can store the latest event in timeline before our initial sync, then find that event in the initial sync
timeline
events and spread backwards from that point. That way, the ongoing sync loop can still append to the end of the timeline and continue seamlessly.
That's not safe though, given that sync orderings can be different depending on which worker you hit.
If you're referring to the problems that come from differing orders from
/sync
(stream_ordering
) and/messages
(topological ordering), this is the exact same problem regardless of how we accomplish things (especially for offline-first clients that persist events).
That's why we have tokens and clients should be using both from
and to
tokens in /messages
to ensure that they get all the events within gaps. Once you start trying to overlay different orderings (via multiple /sync
) you're vulnerable to subtle races and missing events.
Though there is a flaw here in that we don't have a good way of handling clients that cache events from previous sync connections. Ideally they'd deduplicate any events and always call /messages
even if the timelines overlapped.
The expanded timeline and double initial sync both work as solutions because the timelines get fully replaced and future syncs will always return deltas based on what you got. Doing an initial sync after means you have a chunk of timeline but future syncs are not getting deltas from.
Clients can't opt out. With
expanded_timeline
, a client who wants to increasetimeline_limit
for whatever reason must take into accountexpanded_timeline
to know whether they are seeing historical messages (no longer possible to be a dumb naive client).
That's fair for clients that use multiple timeline limits, though if we don't expand the timeline limit then it is also a bit odd: the client isn't getting the data it might expect in all cases. Either way, if clients don't want to deal with the "complexity" of expanded_timeline
flag then it can avoid using multiple different timeline limits. They can either do the double initial sync, or wait for potential future MSCs to allow bulk /messages
.
Basically, these same arguments were made in favor of timeline trickling even when it was twice as bad trying to overload the
limited
andinitial
fields. We now have a dedicatedexpanded_timeline
field to turn this behavior from evil to baseline acceptable.But we also have alternatives to avoid the complication altogether.
That linked thread is about the fact the proxy set initial=True
and Synapse used limited=True
, and was resolved by introducing expanded_timeline
flag. Yes, the fact that the proxy used initial=True
(which we think was a bug?) was pretty bad, but I wouldn't accidentally used the phrase "evil".
Still, as I see it there are two main issues with the current proposal:
- You duplicate some of the events. This is annoying, and it'd be nice if we could instead change the API shape so that sync could say "these events go at the back of this timeline chunk". This is still a problem with using multiple initial syncs.
- Potential complexity for clients. This is a valid concern, though conceptually it's relatively straightforwards: if you see
expanded_timeline
set you replace (rather than add to) the current timeline chunk with what is returned. However this obviously may not be as straightforwards in practice depending on how the clients are designed.
But as you say, we have a baseline acceptable solution that is implemented and tested. We have some alternatives, though all with some downsides (e.g. doesn't support all use cases, is racy, or isn't specced/implemented yet). From my PoV it feels like the right thing to do is accept what we have and land it, the alternatives are stripping out the feature we have or further indefinitely delaying the MSC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even after re-reading the text of the MSC a few times, I don't understand the goal of the
expanded_timeline
flag; does it mean that there might be duplicates? As noted in a previous comment in this thread, clients have to handle duplicates anyways (b/o thetimeline_limit
increasing use case to start with, and then to deduplicate results from /messages which used previously persisted pagination tokens). Maybe theexpanded_timeline
flag is a hint?
The idea is that you can tell the server to (try and) ensure the client has an appropriate amount of history automatically. The main example use is to have it so that the top 20 rooms automatically have more history, so that when users open the room they'll always have a screenful of messages, while only fetching one or two messages for the older rooms which are less likely to be opened. This behaviour can be mimicked by doing /messages
, though this has generally been slower to call (especially if you try and do 20 rooms at once), so the main advantage is speed.
The expanded_timeline
is there to tell clients they should replace the current timeline chunk with what has come down sync, essentially. For example, if a room has three events [A, B, C]
and initially the room is in a list with timeline_limit=1
then the client would receive just [C]
, if a new vent D
came in and caused it to enter a list with timeline_limit=20
then the server would return the room with expanded_timeline=true
and events [A, B, C, D]
.
Overall, I'd love to see more guidance for clients, with respect to when is it better to expand the
timeline_limit
, or when is it better to useroom/ROOM_ID/messages
for back-paginations. For instance, if a client wants to compute the unread status of a room, then, because of e2ee rooms, they may have to back-paginate up to the previous read receipt target event (if there's a gap in the timeline). In this case: should we rather useroom/ROOM_ID/messages
(and send one request for each room), or use custom lists with an expandedtimeline_limit
(the naivest way being increasing thetimeline_limit
for all rooms)?
Yeah, I think the guidance would be basically to use different timeline_limit
with different lists, but otherwise if you want to paginate you should be using /messages
. While you could keep increasing the timeline_limit
, you'd repeatedly get the same events down after each increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please kill unstable_expanded_timeline
with one of these alternatives. It was implemented in Simplified Sliding Sync solely to match what the Sliding Sync proxy did. And the proxy behavior was a bug not a feature (self-described)
Previous conversation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-link to another alternative about using another initial sync request with an increased timeline_limit
-> #4186 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the proxy behavior was a bug not a feature (self-described)
Worth noting that the timeline trickling was a desired feature, albeit one that was regretted. The "bug" there was that the proxy set initial: true
when it shouldn't have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add a flag to disable attempting to backfill over pagination (to match the behaviour of the sync timeline).
👍 This also plays nicely with other ideas being explored in MSC3871 (gaps) and MSC4282 (conditional backfill in /messages
), see #4282 (comment)
Related conversation: #4186 (comment)
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
erikjohnston marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
clokep marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be expanded to explain what this means.
In terms of state:
join
: Access to all of the room state.invite
/knock
: Can only derive state from the stripped state events in the event itself (invite_room_state
,knock_room_state
)leave
/ban
: Depends on the previous membership.- If the previous membership was
join
, access to the state at the time of theleave
/ban
(historical state) - If the previous membership was
invite
, fallback to the stripped state on the invite event - If the previous membership was
leave
/ban
, repeat this logic - If no previous membership, no access to anything.
- If the previous membership was
In terms of timeline events, this probably depends solely on m.room.history_visibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this information is in "Room results" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the information is there now for the most part ✅
We have to assume that timeline events adhere to normal history visibility rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation requirements:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementations to my knowledge:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synapse server implementation
Maybe it's easiest to point to the main directories where the code lives:
element-hq/synapse
->synapse/rest/client/sync.py#L782
element-hq/synapse
->synapse/handlers/sliding_sync/
element-hq/synapse
->tests/rest/client/sliding_sync/
The PRs between @MadLittleMods and @erikjohnston with the ~A-Sync label are also a pretty good encapsulation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conduwuit has implemented simplified sliding sync in https://github.com/girlbossceo/conduwuit/pull/666