- 
                Notifications
    You must be signed in to change notification settings 
- Fork 413
MSC2326: Label based filtering #2326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: old_master
Are you sure you want to change the base?
Changes from 10 commits
3214e90
              8c84d7b
              cb0c68f
              6627b00
              32597a7
              6f36f56
              46d412e
              78c4e16
              05217cd
              da7776f
              158f11a
              3a8f716
              b080337
              61f1396
              d1110a2
              45225af
              f325203
              a6d1249
              a3450a6
              7a21efd
              4b7ca52
              88c93dc
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| # Label based filtering | ||
|  | ||
| ## Problem | ||
|  | ||
| Rooms often contain overlapping conversations, which Matrix should help users | ||
| navigate. | ||
|  | ||
| ## Context | ||
|  | ||
| We already have the concept of 'Replies' to define which messages are responses | ||
| to which, which [MSC1849](https://github.com/matrix-org/matrix-doc/pull/1849) | ||
| proposes extending into a generic mechanism for defining threads which could (in | ||
| future) be paginated both depth-wise and breadth-wise. Meanwhile, | ||
| [MSC1198](https://github.com/matrix-org/matrix-doc/issues/1198) is an alternate | ||
| proposal for threading, which separates conversations into high-level "swim | ||
| lanes" with a new `POST /rooms/{roomId}/thread` API. | ||
|  | ||
| However, fully generic threading (which could be used to implement forum or | ||
| email style semantics) runs a risk of being overly complicated to specify and | ||
| implement and could result in feature creep. This is doubly true if you try to | ||
| implement retrospective threading (e.g. to allow moderators to split off | ||
| messages into their own thread, as you might do in a forum or to help manage | ||
| conversation in a busy chatroom). | ||
|  | ||
| Therefore, this is a simpler proposal to allow messages in a room to be filtered | ||
| based on a given label in order to give basic one-layer-deep threading | ||
| functionality. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still, the 'reply to an existing message' is an important use case. A message cannot be replied to unless it is already labelled. Regular users cannot add labels to messages they did not author. Should a (unique) label to otherwise unlabelled messages be required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't that like replying to the message based on message ID? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #1849 might be better suited for this kind of thing since a client could directly request the message that is referenced... Not sure though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ptman I was thinking subsequent messages could reuse the existing label. Basically avoid nested threads. AFAICT #1849 proposes a strictly one-way relation (i.e. child points to parent). Wouldn't that make it expensive to list the thread starting from a particular message? Each subsequent message would have to be determined by reverse look-up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally prefer hierarchical threads, but either way, regardless of the how the relationships are recorded, they can be shown flat, just like apple mail and gmail do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, but I was under the impression that this proposal specifically aims for flat threads for the sake of simplicity. | ||
|  | ||
| ## Proposal | ||
|  | ||
| We let users specify an optional `m.labels` field onto the events. This field | ||
|         
                  babolivier marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
| lists freeform text labels: | ||
|  | ||
| ```json | ||
| { | ||
| // ... | ||
| "m.labels": [ "somelabel" ] | ||
| } | ||
| ``` | ||
|  | ||
| The labels are expected to be insensitive to case, therefore clients are | ||
| expected to lowercase them before sending them to servers. | ||
|  | ||
| Labels which are prefixed with # are expected to be user-visible and exposed to | ||
| the user by clients as a hashtag, letting the user filter their current room by | ||
| the various hashtags present within it. Labels which are not prefixed with # are | ||
| expected to be hidden from the user by clients (so that they can be used as | ||
| e.g. thread IDs bridged from another platform). | ||
|         
                  babolivier marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
|  | ||
| Clients can use these to filter the overlapping conversations in a room into | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A concern overall with this proposal is that once someone uses  | ||
| different topics. The labels could also be used when bridging as a hashtag to | ||
| help manage the disconnect which can happen when bridging a threaded room to an | ||
| unthreaded one. | ||
|  | ||
| Clients are expected to explicitly set the label on a message if the user's | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the" implies there's only one you have to worry about - shouldn't you be copying the whole set? | ||
| intention is to respond as part of a given labelled topic. For instance, if the | ||
| user is currently filtered to only view messages with a given label, then new | ||
| messages sent should use the same label. Similarly if the user sends a reply to | ||
| a given message, that reply should typically use the same labels as the message | ||
| being replied to. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My remaining doubt is how we're expecting clients to expose adding a label initially, eg. if there would be another button in the composer or similar to add a label to a new message or whether you'd just let them be added retrospectively. Likewise, would we expect to show the labels on each message / show on hover etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added some clarifications on that point in 4b7ca52, but designing UI isn't my area of expertise so I'm happy to discuss it. | ||
|  | ||
| When a user wants to filter a room to given label(s), it defines a filter for | ||
| use with /sync or /messages to limit appropriately. This is done by new `labels` | ||
| and `not_labels` fields to the `EventFilter` object, which specifies a list of | ||
| labels to include or exclude in the given filter. | ||
|  | ||
| ### Encrypted rooms | ||
|  | ||
| In encrypted rooms, the `m.label` field of `m.room.encrypted` events contains, | ||
| for each label of the event that's being encrypted, a SHA256 hash of a | ||
|         
                  babolivier marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||
| contatenation of the text label and the ID of the room the event is being sent | ||
|         
                  babolivier marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||
| to, i.e. `hash = SHA256(label_text + room_id)`. | ||
|         
                  babolivier marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
|  | ||
| The reason behind using a hash built from the text label and the ID of the room | ||
| here instead of e.g. a random opaque string or a peppered hash is to maintain | ||
| consistency of the key without having access to the entire history of the room | ||
| or exposing the actual text of the label to the server, so that e.g. a new | ||
| client joining the room would be able to use the same key for the same label as | ||
| any other client. See the ["Alternative solutions"](#alternative-solutions) for | ||
| more information on this point. | ||
|  | ||
|         
                  babolivier marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
| When filtering events based on their label(s), clients are expected to use the | ||
| hash of the label(s) to filter in or out instead of the actual label text. | ||
|  | ||
| #### Example | ||
|  | ||
| Consider a label `#fun` on a message sent to a room which ID is | ||
| `!someroom:example.com`. Before encryption, the message would be: | ||
|  | ||
| ```json | ||
| { | ||
| "type": "m.room.message", | ||
| "content": { | ||
| "body": "who wants to go down the pub?", | ||
| "msgtype": "m.text", | ||
| "m.labels": [ "#fun" ] | ||
| } | ||
| } | ||
| ``` | ||
|  | ||
| `3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a` is the SHA256 | ||
| hash of the string `#fun!someroom:example.com`. Here's an example code | ||
| (JavaScript) to compute it: | ||
|  | ||
| ```javascript | ||
| label_key_unhashed = "#fun" + "!someroom:example.com" | ||
| hash = crypto.createHash('sha256'); | ||
| hash.write(label_key_unhashed); | ||
| label_key = hash.digest("hex"); // 3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a | ||
| ``` | ||
|  | ||
| Once encrypted, the event would become: | ||
|  | ||
| ```json | ||
| { | ||
| "type": "m.room.encrypted", | ||
| "content": { | ||
| "algorithm": "m.megolm.v1.aes-sha2", | ||
| "ciphertext": "AwgAEpABm6.......", | ||
| "device_id": "SOLZHNGTZT", | ||
| "sender_key": "FRlkQA1enABuOH4xipzJJ/oD8fxiQHj6jrAyyrvzSTY", | ||
| "session_id": "JPWczbhnAivenK3qRwqLLBQu4W13fz1lqQpXDlpZzCg", | ||
| "m.labels": [ | ||
| "3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a" | ||
| ] | ||
| } | ||
| } | ||
| ``` | ||
|  | ||
| ## Problems | ||
|  | ||
| Do we care about internationalising hashtags? | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally the room will be using a specific language, so probably not. | ||
|  | ||
| Too many threading APIs? | ||
|  | ||
| Using hashes means that servers could be inclined to compute rainbow tables to | ||
| read labels on encrypted messages. However, since we're using the room ID as | ||
| some kind of hash, it makes it much more expensive to do because it would mean | ||
| maintaining one rainbow table for each encrypted room it's in, which would | ||
| probably make it not worth the trouble. | ||
|  | ||
| ## Alternative solutions | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was just rereading this proposal, I think @dkasak's points on the main thread are legitimate: that hashing the labels give a very false sense of security here. Given how strong our e2ee is, folks will assume opaque labels are actually encrypted, rather than just obfuscated by a hash which can be easily rainbow-tabled. Personally, I think it'd be fine to add a pepper to the hashed events, and require at the application level that for labels to work in encrypted rooms, the new user must be brought up to speed on the pepper (e.g. by the inviter sharing the pepper in an encrypted message, possibly to-device, after having invited them). This is simpler than using opaque IDs for the unencrypted event headers, as there's only one pepper that needs to be shared to new users, rather than the whole set of opaque->real label mappings. | ||
|  | ||
| Instead of using hashes to identify labels in encrypted messages, using random | ||
| opaque strings was also considered. Bearing in mind that we need to be able to | ||
| use the label identifiers to filter the history of the room server-side (because | ||
| we're not expecting clients to know about the whole history of the room), this | ||
| solution had the following downsides, all originating from the fact that nothing | ||
| would prevent 1000 clients from using each a different identifier: | ||
|  | ||
| * filtering would have serious performances issues in E2EE rooms, as the server | ||
| would need to return all events it knows about which label identifier is any | ||
| of the 1000 identifiers provided by the client, which is quite expensive to | ||
| do. | ||
|  | ||
| * it would be impossible for a filtered `/message` (or `/sync`) request to | ||
| include every event matching the desired label because we can't expect a | ||
| client to know about every identifier that has been used in the whole history | ||
| of the room, or about the fact that another client might suddenly decide to | ||
| use another identifier for the same label text, and include those identifiers | ||
| in its filtered request. | ||
|  | ||
| Another proposed solution would be to use peppered hashes, and to store the | ||
| pepper in the encrypted event. However, this solution would have the same | ||
| downsides as described above. | ||
|  | ||
| ## Unstable prefix | ||
|  | ||
| Unstable implementations should hook up `org.matrix.labels` rather than | ||
| `m.labels`. When defining filters, they should also use `org.matrix.labels` and | ||
| `org.matrix.not_labels` in the `EventFilter` object. | ||
|  | ||
| Additionally, servers implementing this feature should advertise that they do so | ||
| by exposing a `label_based_filtering` flag in the `unstable_features` part of | ||
|         
                  babolivier marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||
| the `/versions` response. | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bubu
Agreed that this would be a handy feature, and that listing all labels in a room is a nice thing to have generally. I'm thinking of one possible way to do this, which would be to add an endpoint that exposes the list of labels the server knows have been used in the room (which should be fairly easy given the server will probably already store
(event_id, label)tuples in its database for efficiency). For encrypted rooms, this would return a list of hashes (which is what the server considers as a list of labels for that room, since it doesn't know about the actual labels), and clients would then be able to resolve those hashes by calling/messageswith a filter containing the labels to resolve, and extracting the labels from the response (which contains events that the client should be able to decrypt). This would allow such a feature to work well without having to leak more metadata.wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
As for metadata concerns, these are still there, as we are working on a set of ~6 fixed labels for this usecase. But this is basically already covered in the "Security Considerations" section here.
Whether or not the actually usage of these tags for images/links/etc. will become optional in E2EE chats is not part of this MSC I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we resolve this thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that once I've updated the MSC to describe this solution, which I haven't got time to do yet.