-
Couldn't load subscription status.
- Fork 414
MSC4248: Pull-based presence #4248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 6 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
ec4eeb9
Incomplete draft
nexy7574 70cafb5
Formatting and notes for missing sections
tcpipuk b315b9f
Suggested answers to alternatives/security/etc
tcpipuk 515af28
Merge pull request #1 from tcpipuk/patch-1
nexy7574 ab3ff51
trim down duplicate spaces on line endings
nexy7574 f19304e
Add vendor prefix
nexy7574 ab61913
MSC number can no-longer change
nexy7574 4c4a63f
Add a note about distributing load & 403 responses
nexy7574 3291d20
Explain summary more, note that PBP does not explicitly replace EDU p…
nexy7574 2e4ba2e
Remove "Posting presence in room" alternative
nexy7574 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,181 @@ | ||
| # MSC4248: Pull-based presence | ||
|
|
||
| _TODO: MSC number may change_ | ||
nexy7574 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Currently, presence in Matrix imposes a considerable burden on all participating servers. | ||
| Matrix presence works by having the client notify its homeserver when a user changes their | ||
| presence (online, unavailable, or offline). The homeserver then delivers this information | ||
| to every server that might be interested, as described in the | ||
| [specification's presence section](https://spec.matrix.org/v1.13/server-server-api/#presence). | ||
|
|
||
| However, this approach is highly inefficient and wasteful, requiring significant resources | ||
| for all involved parties. Many servers have therefore disabled federated presence, and many | ||
| clients have consequently chosen not to implement presence at all. | ||
|
|
||
| This MSC proposes a new pull-based model for presence that replaces the current "push-based" | ||
| EDU presence mechanism. The aim is to save bandwidth and CPU usage for all servers, and to | ||
| reduce superfluous data exchanged between uninterested servers and clients. | ||
|
|
||
| ## Proposal | ||
|
|
||
| Today, when a user's presence is updated, their homeserver receives the update and decides | ||
| which remote servers might need it. It then sends an EDU to those servers. Each remote | ||
| server processes and relays the data to its interested clients. This creates substantial | ||
| bandwidth usage and duplication of effort. | ||
|
|
||
| In contrast, this MSC suggests a pull-based approach: | ||
|
|
||
| 1. When the user updates their presence, their homeserver stores the new status without | ||
| pushing it to other servers. | ||
| 2. Other servers periodically query that homeserver for presence updates, in bulk, for the | ||
| users they track. | ||
| 3. The homeserver returns only presence information that has changed since the last query. | ||
|
|
||
| Clients continue to request presence as before (e.g. `/sync` and | ||
| `/presence/{userId}/status`). No client-side changes are strictly required. | ||
|
|
||
| Servers instead calculate which users they are interested in and query the homeservers of | ||
| those users at intervals. The new proposed federation endpoint is | ||
| `/federation/v1/query/presence`. This allows servers to request presence data in bulk for | ||
| the relevant users on that homeserver. | ||
|
|
||
| ### New flow | ||
|
|
||
| 1. User 1 updates their presence on server A. | ||
| 2. Server A stores the new presence and timestamp. | ||
| 3. Server B queries server A about users 1, 2, and 3, including the time it last observed | ||
| their presence changes. | ||
| 4. Server A checks its data for these users and responds only with updated presence info. | ||
| 5. Server B updates its local records and informs any interested clients. | ||
| 6. Server B repeats the query at the next interval. | ||
|
|
||
| By pulling presence only when needed, each server can maintain accurate user status without | ||
| excessive data broadcasts. This is significantly more efficient than pushing updates to | ||
| every server that might be interested. | ||
|
|
||
| #### New federation endpoint: `/federation/v1/query/presence` | ||
|
|
||
| **Servers must implement:** | ||
|
|
||
| `POST /federation/v1/query/presence` | ||
|
|
||
| **Request body example:** | ||
|
|
||
| ```json | ||
| { | ||
| "@user1:server.a": 1735324578000, | ||
| "@user2:server.a": 0 | ||
| } | ||
| ``` | ||
|
|
||
| Here, `@user1:server.a` was last updated at `1735324578000` (Unix milliseconds) as seen by | ||
| the querying server. For `@user2:server.a`, the querying server has no stored timestamp. | ||
|
|
||
| Homeservers **must not** proxy requests for presence: only users on the homeserver being | ||
| queried should appear in the request. Likewise, the responding server must only provide | ||
| presence data for its own users. | ||
|
|
||
| #### 200 OK response | ||
|
|
||
| If successful, the response is a JSON object mapping user IDs to | ||
| [`m.presence` data](https://spec.matrix.org/v1.13/client-server-api/#mpresence). For example: | ||
|
|
||
| ```json | ||
| { | ||
| "@user1:server.a": { | ||
| "presence": "online", | ||
| "last_active_ago": 300 | ||
| }, | ||
| "@user2:server.a": { | ||
| "presence": "unavailable", | ||
| "status_msg": "Busy, try again in 5 minutes", | ||
| "last_active_ago": 0 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Users whose presence has not changed since the last time the querying server checked should | ||
| not appear in the response. An empty response body is valid if no updates exist. | ||
|
|
||
| #### 403 Forbidden response | ||
|
|
||
| If the remote server does not federate presence or explicitly blocks the querying server, it | ||
| should respond with | ||
| [HTTP 403 Forbidden](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403): | ||
|
|
||
| ```json | ||
| { | ||
| "errcode": "M_FORBIDDEN", | ||
| "error": "Federation disabled for presence", | ||
| "reason": "This server does not federate presence information" | ||
| } | ||
| ``` | ||
|
|
||
| #### 413 Content too large response | ||
|
|
||
| To avoid large payloads and timeouts, servers should cap the number of presence queries in a | ||
| single request. A recommended default limit is 500 users. If a request exceeds this limit, | ||
| respond with [HTTP 413 Payload Too Large](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413): | ||
|
|
||
| ```json | ||
| { | ||
| "errcode": "M_TOO_LARGE", | ||
| "error": "Too many users requested", | ||
| "max_users": 500 | ||
| } | ||
| ``` | ||
|
|
||
| ## Potential issues | ||
|
|
||
| 1. **Stale data**: If a server's polling interval is long, clients may see outdated status. | ||
| However, this trade-off is often preferable to constant pushing of updates, which wastes | ||
| bandwidth and CPU. | ||
| 2. **Performance bursts**: Polling in bulk might cause periodic spikes in traffic. In | ||
| practice, scheduling queries reduces overhead compared to perpetual push notifications. | ||
| 3. **Server downtime**: If a homeserver is unavailable, remote servers cannot retrieve | ||
| updates. This is still simpler to handle than a push-based system that continually retries. | ||
| 4. **Partial coverage**: Each server must poll multiple homeservers if users span many | ||
| domains. This is still more controlled than blindly receiving all presence EDUs from | ||
| across the federation. | ||
| 5. **Implementation complexity**: Homeservers must track timestamps for each user's presence | ||
| changes. Despite this, the overall load and bandwidth consumption should be lower than the | ||
| push-based approach. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| 1. **Optimising push-based EDUs**: Servers could throttle or batch outgoing presence. While | ||
| it reduces the raw volume of messages, uninterested servers might still receive unwanted | ||
| data. | ||
| 2. **Hybrid push-pull**: Pushing for high-profile users while polling for others can reduce | ||
| traffic but complicates implementation. It also risks partially reverting to old, | ||
| inefficient patterns. | ||
| 3. **Deprecating presence**: Servers could disable presence entirely. This has already | ||
| happened in some deployments but removes a key real-time user activity feature. | ||
| 4. **Posting presence in rooms**: Embedding presence as timeline events could leverage | ||
| existing distribution. However, this would complicate large, high-traffic rooms and let | ||
| presence be tracked indefinitely. The added data overhead and privacy impact are worse | ||
| than poll-based federation for many use cases. | ||
|
|
||
| ## Security considerations | ||
|
|
||
| 1. **Data visibility**: Because presence can reveal user activity times, queries and responses | ||
| must be restricted to legitimate servers. Proper ACLs and rate-limiting are advised. | ||
| 2. **Query abuse**: A malicious server could repeatedly query for large user lists to track | ||
| patterns or overload a homeserver. Bulk requests limit overhead more effectively than | ||
| repeated push, but the server should still implement protections. | ||
| 3. **Privacy**: Even pull-based presence shares user status and activity times. Operators | ||
| should minimise leakages and evaluate if presence is necessary for all users. | ||
| 4. **Server authentication**: Proper federation checks remain critical to prevent | ||
| impersonation or man-in-the-middle attacks. | ||
|
|
||
| ## Unstable prefix | ||
|
|
||
| If this proposal is adopted prior to finalisation, implementers must ensure they can migrate | ||
| to the final version. This typically involves using `/unstable` endpoints and vendor prefixes, | ||
| as per [MSC2324](https://github.com/matrix-org/matrix-doc/pull/2324). | ||
|
|
||
| The vendor prefix for this MSC should be `uk.co.nexy7574.pull_based_presence`. | ||
|
|
||
| ## Dependencies | ||
|
|
||
| This MSC does not depend on any other MSCs. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could instead have servers signal that they don't want presence updates (for those that turn it off), as well as not sending presence to servers we haven't recently interacted with (ie. we dont have a message in the last 50 messages in a room's timeline).
I worry that making it pull based would ruin performance as you'll be dealing with large incomming (continuous) request volume rather than the spurious outgoing burst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For synapse, it might be worthwhile to limit outgoing presence to the result of
select * from destinations where failure_ts is null;? (AKA servers we know are online)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this model, servers can respond with 403 to indicate that they do not federate their presence, and remote servers should not request it again (at least, for a very long time). This voids the need for indicating in the first place.
As for the performance, the requests would not be continuous. Servers would configure how often they request presence, how long it's cached locally for, etc etc, and as such would distribute the requests over time.
That "spurious outgoing burst" is more like a constant flat line for lower-end servers (often single-user or cloud-based) since they will be continuously sending presence updates to potentially tens of thousands of other servers, most of which will not be interested in the slightest. This is, as is noted, a waste of bandwidth, cpu, and other resources, meaning it's usually futile for lower-end/smaller servers to enable it, and just a waste of resources for higher end/larger servers.
At least with a pull-based model, the ability to bulk-fetch presence would be much ligher on the origin server than constantly hammering out new EDUs, especially when the homeserver can return down to an empty object when there's been no presence changes.
I'm yet to have a chance to means test anything similar to this, but I know that my servers can handle hundreds of thousands of inbound federation requests per minute just fine, I'm sure a few thousand extra presence requests would be of no harm (compared to the literally devastating effects of sending out thousands of EDUs instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could indeed be an optimisation, but then what about when dead servers come back? They have then missed out on the previous presence by nature of EDUs. Pull-based presence would mean they can request it when they come back and have the most up-to-date presence immediately, rather than needing to wait for the next presence update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, load distribution wise, presence doesn't need to be sent immediately, you could have a background task that does a small amount of concurrent pushing (ie. try 10 servers at a time), instead of trying to send it to all servers all at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "small background task" doesn't scale in any sort of desirable way here, and if we're scheduling outgoing sending, what's the point of even having the EDU anyway? because at that point, there's even less sense of urgency regarding keeping presence up to date.