Skip to content
181 changes: 181 additions & 0 deletions proposals/4248-pull-based-presence.md

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could instead have servers signal that they don't want presence updates (for those that turn it off), as well as not sending presence to servers we haven't recently interacted with (ie. we dont have a message in the last 50 messages in a room's timeline).

I worry that making it pull based would ruin performance as you'll be dealing with large incomming (continuous) request volume rather than the spurious outgoing burst.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For synapse, it might be worthwhile to limit outgoing presence to the result of select * from destinations where failure_ts is null;? (AKA servers we know are online)

Copy link
Author

@nexy7574 nexy7574 Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this model, servers can respond with 403 to indicate that they do not federate their presence, and remote servers should not request it again (at least, for a very long time). This voids the need for indicating in the first place.

As for the performance, the requests would not be continuous. Servers would configure how often they request presence, how long it's cached locally for, etc etc, and as such would distribute the requests over time.
That "spurious outgoing burst" is more like a constant flat line for lower-end servers (often single-user or cloud-based) since they will be continuously sending presence updates to potentially tens of thousands of other servers, most of which will not be interested in the slightest. This is, as is noted, a waste of bandwidth, cpu, and other resources, meaning it's usually futile for lower-end/smaller servers to enable it, and just a waste of resources for higher end/larger servers.

At least with a pull-based model, the ability to bulk-fetch presence would be much ligher on the origin server than constantly hammering out new EDUs, especially when the homeserver can return down to an empty object when there's been no presence changes.

I'm yet to have a chance to means test anything similar to this, but I know that my servers can handle hundreds of thousands of inbound federation requests per minute just fine, I'm sure a few thousand extra presence requests would be of no harm (compared to the literally devastating effects of sending out thousands of EDUs instead)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For synapse, it might be worthwhile to limit outgoing presence to the result of select * from destinations where failure_ts is null;? (AKA servers we know are online)

This could indeed be an optimisation, but then what about when dead servers come back? They have then missed out on the previous presence by nature of EDUs. Pull-based presence would mean they can request it when they come back and have the most up-to-date presence immediately, rather than needing to wait for the next presence update.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about dead servers commig back
Wouldn't this be solved by them being marked as online when they process a new PDU?

Additionally, load distribution wise, presence doesn't need to be sent immediately, you could have a background task that does a small amount of concurrent pushing (ie. try 10 servers at a time), instead of trying to send it to all servers all at once?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "small background task" doesn't scale in any sort of desirable way here, and if we're scheduling outgoing sending, what's the point of even having the EDU anyway? because at that point, there's even less sense of urgency regarding keeping presence up to date.

Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# MSC4248: Pull-based presence

_TODO: MSC number may change_

Currently, presence in Matrix imposes a considerable burden on all participating servers.
Matrix presence works by having the client notify its homeserver when a user changes their
presence (online, unavailable, or offline). The homeserver then delivers this information
to every server that might be interested, as described in the
[specification's presence section](https://spec.matrix.org/v1.13/server-server-api/#presence).

However, this approach is highly inefficient and wasteful, requiring significant resources
for all involved parties. Many servers have therefore disabled federated presence, and many
clients have consequently chosen not to implement presence at all.

This MSC proposes a new pull-based model for presence that replaces the current "push-based"
EDU presence mechanism. The aim is to save bandwidth and CPU usage for all servers, and to
reduce superfluous data exchanged between uninterested servers and clients.

## Proposal

Today, when a user's presence is updated, their homeserver receives the update and decides
which remote servers might need it. It then sends an EDU to those servers. Each remote
server processes and relays the data to its interested clients. This creates substantial
bandwidth usage and duplication of effort.

In contrast, this MSC suggests a pull-based approach:

1. When the user updates their presence, their homeserver stores the new status without
pushing it to other servers.
2. Other servers periodically query that homeserver for presence updates, in bulk, for the
users they track.
3. The homeserver returns only presence information that has changed since the last query.

Clients continue to request presence as before (e.g. `/sync` and
`/presence/{userId}/status`). No client-side changes are strictly required.

Servers instead calculate which users they are interested in and query the homeservers of
those users at intervals. The new proposed federation endpoint is
`/federation/v1/query/presence`. This allows servers to request presence data in bulk for
the relevant users on that homeserver.

### New flow

1. User 1 updates their presence on server A.
2. Server A stores the new presence and timestamp.
3. Server B queries server A about users 1, 2, and 3, including the time it last observed
their presence changes.
4. Server A checks its data for these users and responds only with updated presence info.
5. Server B updates its local records and informs any interested clients.
6. Server B repeats the query at the next interval.

By pulling presence only when needed, each server can maintain accurate user status without
excessive data broadcasts. This is significantly more efficient than pushing updates to
every server that might be interested.

#### New federation endpoint: `/federation/v1/query/presence`

**Servers must implement:**

`POST /federation/v1/query/presence`

**Request body example:**

```json
{
"@user1:server.a": 1735324578000,
"@user2:server.a": 0
}
```

Here, `@user1:server.a` was last updated at `1735324578000` (Unix milliseconds) as seen by
the querying server. For `@user2:server.a`, the querying server has no stored timestamp.

Homeservers **must not** proxy requests for presence: only users on the homeserver being
queried should appear in the request. Likewise, the responding server must only provide
presence data for its own users.

#### 200 OK response

If successful, the response is a JSON object mapping user IDs to
[`m.presence` data](https://spec.matrix.org/v1.13/client-server-api/#mpresence). For example:

```json
{
"@user1:server.a": {
"presence": "online",
"last_active_ago": 300
},
"@user2:server.a": {
"presence": "unavailable",
"status_msg": "Busy, try again in 5 minutes",
"last_active_ago": 0
}
}
```

Users whose presence has not changed since the last time the querying server checked should
not appear in the response. An empty response body is valid if no updates exist.

#### 403 Forbidden response

If the remote server does not federate presence or explicitly blocks the querying server, it
should respond with
[HTTP 403 Forbidden](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403):

```json
{
"errcode": "M_FORBIDDEN",
"error": "Federation disabled for presence",
"reason": "This server does not federate presence information"
}
```

#### 413 Content too large response

To avoid large payloads and timeouts, servers should cap the number of presence queries in a
single request. A recommended default limit is 500 users. If a request exceeds this limit,
respond with [HTTP 413 Payload Too Large](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413):

```json
{
"errcode": "M_TOO_LARGE",
"error": "Too many users requested",
"max_users": 500
}
```

## Potential issues

1. **Stale data**: If a server's polling interval is long, clients may see outdated status.
However, this trade-off is often preferable to constant pushing of updates, which wastes
bandwidth and CPU.
2. **Performance bursts**: Polling in bulk might cause periodic spikes in traffic. In
practice, scheduling queries reduces overhead compared to perpetual push notifications.
3. **Server downtime**: If a homeserver is unavailable, remote servers cannot retrieve
updates. This is still simpler to handle than a push-based system that continually retries.
4. **Partial coverage**: Each server must poll multiple homeservers if users span many
domains. This is still more controlled than blindly receiving all presence EDUs from
across the federation.
5. **Implementation complexity**: Homeservers must track timestamps for each user's presence
changes. Despite this, the overall load and bandwidth consumption should be lower than the
push-based approach.

## Alternatives

1. **Optimising push-based EDUs**: Servers could throttle or batch outgoing presence. While
it reduces the raw volume of messages, uninterested servers might still receive unwanted
data.
2. **Hybrid push-pull**: Pushing for high-profile users while polling for others can reduce
traffic but complicates implementation. It also risks partially reverting to old,
inefficient patterns.
3. **Deprecating presence**: Servers could disable presence entirely. This has already
happened in some deployments but removes a key real-time user activity feature.
4. **Posting presence in rooms**: Embedding presence as timeline events could leverage
existing distribution. However, this would complicate large, high-traffic rooms and let
presence be tracked indefinitely. The added data overhead and privacy impact are worse
than poll-based federation for many use cases.

## Security considerations

1. **Data visibility**: Because presence can reveal user activity times, queries and responses
must be restricted to legitimate servers. Proper ACLs and rate-limiting are advised.
2. **Query abuse**: A malicious server could repeatedly query for large user lists to track
patterns or overload a homeserver. Bulk requests limit overhead more effectively than
repeated push, but the server should still implement protections.
3. **Privacy**: Even pull-based presence shares user status and activity times. Operators
should minimise leakages and evaluate if presence is necessary for all users.
4. **Server authentication**: Proper federation checks remain critical to prevent
impersonation or man-in-the-middle attacks.

## Unstable prefix

If this proposal is adopted prior to finalisation, implementers must ensure they can migrate
to the final version. This typically involves using `/unstable` endpoints and vendor prefixes,
as per [MSC2324](https://github.com/matrix-org/matrix-doc/pull/2324).

The vendor prefix for this MSC should be `uk.co.nexy7574.pull_based_presence`.

## Dependencies

This MSC does not depend on any other MSCs.