Conversation
MadLittleMods
left a comment
There was a problem hiding this comment.
Partial review
Reviewed up to TestMSC4242OnSendJoinSJ04
Haven't looked at the utils in tests/msc4242/msc4242_api_test.go yet
| PrevEvents: []string{room.CurrentState(spec.MRoomMember, sender).EventID()}, | ||
| }, | ||
| // no event in the state DAG has this event ID. | ||
| PrevStateEvents: []string{"$4PRgaFIMcD9z4vzgkUUm0YI5CZHYORUPzWGJac6guAo"}, |
There was a problem hiding this comment.
We can name this more obviously
| PrevStateEvents: []string{"$4PRgaFIMcD9z4vzgkUUm0YI5CZHYORUPzWGJac6guAo"}, | |
| PrevStateEvents: []string{"$eventIdDoesNotExist"}, |
(also applies below)
There was a problem hiding this comment.
That's not the same because it isn't a valid event ID.
There was a problem hiding this comment.
What's not valid about it?
$opaque_idHowever, the precise format depends upon the room version specification: early room versions included a domain component, whereas more recent versions omit the domain and use a base64-encoded hash instead.
It seems like the only requirement maybe is that it should be base64 decodable but there isn't any hash comparison to do so I don't think that's a requirement either.
3.3 Event IDs
The event ID is the reference hash of the event encoded using a variation of Unpadded Base64 which replaces the 62nd and 63rd characters with
-and_instead of using+and/. This matches RFC4648’s definition of URL-safe base64.Event IDs are still prefixed with $ and might result in looking like
$Rqnc-F-dvnEYJTyHq_iKxU2bZ1CI92-kuZq3a5lr5Zg.
There was a problem hiding this comment.
No, I want this to look like a normal event ID, else HSes may reject early because it isn't a valid SHA256 (length).
We can pad it out however long it needs to be. But seems like servers should treat these opaquely.
| PrevStateEvents: []string{"$4PRgaFIMcD9z4vzgkUUm0YI5CZHYORUPzWGJac6guAo"}, | |
| PrevStateEvents: []string{"$eventIdDoesNotExist000000000000000000000000"}, |
| } | ||
|
|
||
| /* | ||
| func TestMSC4242SendJoinFasterSJ03Inbound(t *testing.T) { |
There was a problem hiding this comment.
Because the Synapse implementation does not support faster room joins yet.
| defer cancel() | ||
| bob := srv.UserID("bob") | ||
| testCases := faultyEventTestCases | ||
| var mu sync.Mutex |
There was a problem hiding this comment.
What's the point of this lock? (comment)
There was a problem hiding this comment.
This was added:
var mu sync.Mutex // protects faultyEvents
We should name it more obviously
Co-authored-by: Eric Eastwood <erice@element.io>
| if endIndex > len(eventIDsPerRequest) { | ||
| endIndex = len(eventIDsPerRequest) | ||
| } | ||
| wantEventIDs := eventIDsPerRequest[startIndex:endIndex] |
There was a problem hiding this comment.
We might as well move this lower down to the actual comparison
| } | ||
| // Repeatedly calls /get_missing_events on hs1 starting from fromEvent, in batches of batchSize. | ||
| // If stateDAG is set, asks the server to walk the state DAG. | ||
| // Asserts that events are returned (over multiple requests) in the order provided by eventIDsPerRequest. |
There was a problem hiding this comment.
We should document which order eventIDsPerRequest should be in
| if !slices.Equal(got, wantEventIDs) { | ||
| ct.Errorf(t, "failed to see correct event IDs in test '%s'. \nGot %v \nWant %v", name, got, wantEventIDs) | ||
| return | ||
| } |
There was a problem hiding this comment.
To get a nicer diff, it would be nice to re-use the assertEventsInOrder utility from tests/csapi/room_messages_test.go
| ct.Errorf(t, "failed to see correct event IDs in test '%s'. \nGot %v \nWant %v", name, got, wantEventIDs) | ||
| return | ||
| } | ||
| if endIndex == len(eventIDsPerRequest) { |
There was a problem hiding this comment.
We have the explicit endIndex = len(eventIDsPerRequest) above to prevent over-runs but seems sketchy as code evolves over time.
Perhaps we should adjust the condition to be more forgiving or have an additional assert(endIndex <= eventIDsPerRequest, "programming error")
| if endIndex == len(eventIDsPerRequest) { | |
| if endIndex >= len(eventIDsPerRequest) { |
| must.Equal(t, lastEvent.Exists(), true, "last timeline entry for room does not exist") | ||
| expectedJoinPrevStateEvents := prevStateEvents(t, lastEvent) | ||
| bob := srv.UserID("bob") | ||
| _, sendJoinResp := MustJoinRoom(t, srv, deployment, spec.ServerName("hs1"), roomID, bob) |
There was a problem hiding this comment.
| _, sendJoinResp := MustJoinRoom(t, srv, deployment, spec.ServerName("hs1"), roomID, bob) | |
| _, sendJoinResp := MustJoinRoom(t, srv, deployment, deployment.GetFullyQualifiedHomeserverName(t, "hs1"), roomID, bob) |
| return room.Timeline[i].EventID() | ||
| } | ||
| } | ||
| t.Logf("%s: failed to find any state event in %d timeline events, no prev_state_events will be set!", room.RoomID, len(room.Timeline)) |
There was a problem hiding this comment.
Is this ever desired? (or programming error)
| serversInRoom := room.ServersInRoom() | ||
| for _, srvName := range serversInRoom { | ||
| res.ServersInRoom = append(res.ServersInRoom, string(srvName)) | ||
| } |
There was a problem hiding this comment.
Looks like this should be gated by omitServersInRoom
| // LoadStateDAG loads all the events under state_dag and puts their event_id: prev_state_events into a map. | ||
| // Performs no verification that the DAG is connected. | ||
| func LoadStateDAG(t ct.TestLike, sendJoinResp fclient.RespSendJoin) (joinEventID string, stateDAG map[string][]string) { | ||
| roomVer := gomatrixserverlib.MustGetRoomVersion(roomVersion) | ||
| joinEvent, err := roomVer.NewEventFromTrustedJSON(sendJoinResp.Event, false) | ||
| must.NotError(t, "failed to load join event", err) | ||
| stateDAG = make(map[string][]string) // event_id => prev_state_events | ||
| for _, stateEventJSON := range sendJoinResp.StateDAG { | ||
| stateEvent, err := roomVer.NewEventFromTrustedJSON(stateEventJSON, false) | ||
| must.NotError(t, "failed to load event json", err) | ||
| stateEventGJSON := gjson.ParseBytes(stateEventJSON) | ||
| prevStateEvents := prevStateEvents(t, stateEventGJSON) | ||
| stateDAG[stateEvent.EventID()] = prevStateEvents | ||
| } | ||
|
|
||
| // Include the prev state events of the join event itself as that is what we care about | ||
| joinEventPrevStateEvents := prevStateEvents(t, gjson.ParseBytes(sendJoinResp.Event)) | ||
| stateDAG[joinEvent.EventID()] = joinEventPrevStateEvents | ||
| return joinEvent.EventID(), stateDAG | ||
| } |
There was a problem hiding this comment.
LoadStateDAG is unused
| return joinEvent.EventID(), stateDAG | ||
| } | ||
|
|
||
| // We have to implement our own /make|send_join dance here as gmsl is unaware of state DAGs. |
|
|
||
| func TestMain(m *testing.M) { | ||
| complement.TestMain(m, "msc4242") | ||
| } |
There was a problem hiding this comment.
Note: While I have read the entire MSC and this PR, I don't think I've onboarded enough to recognize if there are missing test cases. It's my first time with my own hand in the fire when it comes to State DAG's.
This PR is so large, it's probably worth having an LLM also sanity check this against the MSC for completeness.
There was a problem hiding this comment.
I've shoved this through Claude and it has made some obvious suggestions:
Missing: G: has >0 prev_state_events and it is an m.room.create event
STATE03: XXX receiving a state event via /send which does NOT alter the current state (say it's a concurrent topic change which does not win state res) should | should not go down /sync and be present in the timeline.
Faster remote room joins is commented out.
And some less obvious ones:
No e2e soft-failure convergence test
The spec dedicates a section to the specific failure mode where soft-failing an event prevents a later server from seeing it and converging on correct state. SJ04C covers the fix (including concurrent events in /send_join), but there's no multi-server e2e test (msc4242_e2e_test.go only has the basic TestMSC4242FederationSimple). The spec's scenario: "Bystander soft-fails C; long-dead server rejoins and never learns C's effect on state" is not exercised.
- No test for prev_state_events cross-room rejection
The auth rules section adds:
If there are entries which do not belong in the same room, reject.
None of the faultyEventTestCases generates a prev_state_events entry pointing to a state event from a different room. This is distinct from "bogus event ID" (CodeSuffix A/B) as those are unknown IDs, not known-but-wrong-room IDs.
When probed for more missing test cases, it produced:
earliest_events never exercised with non-empty values
Every call to /get_missing_events across all tests passes EarliestEvents: []string{}. The spec explicitly defines the algorithm as mapping earliest_events to an initial "seen" set to bound the walk. There's no test verifying the server actually stops walking when it hits a known event ID. This matters because the recursive pagination in assertGetMissingEventsRecursively works around this by updating latest_events instead, so a broken earliest_events implementation would never be caught.
Outbound prev_state_events on non-state events never checked
The spec states all events — including messages — must carry prev_state_events. The tests verify the HS handles incoming events with various prev_state_events configurations, and TestMSC4242SendJoinSJ01Inbound checks prev_state_events on the join event specifically. But there's no test inspecting events sent by hs1 (messages, profile changes, etc.) to confirm they actually include prev_state_events pointing to the correct state DAG forward extremities. The E2E test only checks event IDs arrive, not event structure.
Duplicate event IDs in prev_state_events not tested
The spec removes the check for duplicate (type, state_key) pairs because prev_state_events can legitimately reference two events with the same tuple on different branches. But there's no faulty event test case for a prev_state_events list containing the same event ID twice. Whether this should be rejected (malformed) or tolerated (treated as deduplicated) isn't tested either way.
No test that /state, /state_ids, /event_auth are not called during normal operation
The spec explicitly obsoletes these three endpoints. While srv.UnexpectedRequestsAreErrors being true (the default) would catch this in some tests, several tests set it to false, and no test specifically asserts the HS does not fall back to these endpoints when processing events or calculating room state. A targeted test with a Complement server that returns 404 for all three endpoints and verifies the HS still converges correctly would strengthen this.
I'll consider these and resolve this issue when done.
MSC4242: State DAGs
Pull Request Checklist