Skip to content

MSC4242: State DAGs#841

Open
kegsay wants to merge 5 commits intomainfrom
kegan/4242
Open

MSC4242: State DAGs#841
kegsay wants to merge 5 commits intomainfrom
kegan/4242

Conversation

@kegsay
Copy link
Member

@kegsay kegsay commented Feb 2, 2026

MSC4242: State DAGs

Pull Request Checklist

Copy link
Collaborator

@MadLittleMods MadLittleMods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

Reviewed up to TestMSC4242OnSendJoinSJ04

Haven't looked at the utils in tests/msc4242/msc4242_api_test.go yet

PrevEvents: []string{room.CurrentState(spec.MRoomMember, sender).EventID()},
},
// no event in the state DAG has this event ID.
PrevStateEvents: []string{"$4PRgaFIMcD9z4vzgkUUm0YI5CZHYORUPzWGJac6guAo"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can name this more obviously

Suggested change
PrevStateEvents: []string{"$4PRgaFIMcD9z4vzgkUUm0YI5CZHYORUPzWGJac6guAo"},
PrevStateEvents: []string{"$eventIdDoesNotExist"},

(also applies below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the same because it isn't a valid event ID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's not valid about it?

$opaque_id

However, the precise format depends upon the room version specification: early room versions included a domain component, whereas more recent versions omit the domain and use a base64-encoded hash instead.

-- https://spec.matrix.org/v1.16/appendices/#event-ids

It seems like the only requirement maybe is that it should be base64 decodable but there isn't any hash comparison to do so I don't think that's a requirement either.

3.3 Event IDs

The event ID is the reference hash of the event encoded using a variation of Unpadded Base64 which replaces the 62nd and 63rd characters with - and _ instead of using + and /. This matches RFC4648’s definition of URL-safe base64.

Event IDs are still prefixed with $ and might result in looking like $Rqnc-F-dvnEYJTyHq_iKxU2bZ1CI92-kuZq3a5lr5Zg.

-- https://spec.matrix.org/v1.16/rooms/v12/#event-ids

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I want this to look like a normal event ID, else HSes may reject early because it isn't a valid SHA256 (length).

-- @kegsay, #841 (comment)

We can pad it out however long it needs to be. But seems like servers should treat these opaquely.

Suggested change
PrevStateEvents: []string{"$4PRgaFIMcD9z4vzgkUUm0YI5CZHYORUPzWGJac6guAo"},
PrevStateEvents: []string{"$eventIdDoesNotExist000000000000000000000000"},

}

/*
func TestMSC4242SendJoinFasterSJ03Inbound(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the Synapse implementation does not support faster room joins yet.

defer cancel()
bob := srv.UserID("bob")
testCases := faultyEventTestCases
var mu sync.Mutex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this lock? (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added:

	var mu sync.Mutex // protects faultyEvents

We should name it more obviously

if endIndex > len(eventIDsPerRequest) {
endIndex = len(eventIDsPerRequest)
}
wantEventIDs := eventIDsPerRequest[startIndex:endIndex]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might as well move this lower down to the actual comparison

}
// Repeatedly calls /get_missing_events on hs1 starting from fromEvent, in batches of batchSize.
// If stateDAG is set, asks the server to walk the state DAG.
// Asserts that events are returned (over multiple requests) in the order provided by eventIDsPerRequest.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document which order eventIDsPerRequest should be in

Comment on lines +121 to +124
if !slices.Equal(got, wantEventIDs) {
ct.Errorf(t, "failed to see correct event IDs in test '%s'. \nGot %v \nWant %v", name, got, wantEventIDs)
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get a nicer diff, it would be nice to re-use the assertEventsInOrder utility from tests/csapi/room_messages_test.go

ct.Errorf(t, "failed to see correct event IDs in test '%s'. \nGot %v \nWant %v", name, got, wantEventIDs)
return
}
if endIndex == len(eventIDsPerRequest) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the explicit endIndex = len(eventIDsPerRequest) above to prevent over-runs but seems sketchy as code evolves over time.

Perhaps we should adjust the condition to be more forgiving or have an additional assert(endIndex <= eventIDsPerRequest, "programming error")

Suggested change
if endIndex == len(eventIDsPerRequest) {
if endIndex >= len(eventIDsPerRequest) {

must.Equal(t, lastEvent.Exists(), true, "last timeline entry for room does not exist")
expectedJoinPrevStateEvents := prevStateEvents(t, lastEvent)
bob := srv.UserID("bob")
_, sendJoinResp := MustJoinRoom(t, srv, deployment, spec.ServerName("hs1"), roomID, bob)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_, sendJoinResp := MustJoinRoom(t, srv, deployment, spec.ServerName("hs1"), roomID, bob)
_, sendJoinResp := MustJoinRoom(t, srv, deployment, deployment.GetFullyQualifiedHomeserverName(t, "hs1"), roomID, bob)

return room.Timeline[i].EventID()
}
}
t.Logf("%s: failed to find any state event in %d timeline events, no prev_state_events will be set!", room.RoomID, len(room.Timeline))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever desired? (or programming error)

Comment on lines +164 to +167
serversInRoom := room.ServersInRoom()
for _, srvName := range serversInRoom {
res.ServersInRoom = append(res.ServersInRoom, string(srvName))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be gated by omitServersInRoom

Comment on lines +411 to +430
// LoadStateDAG loads all the events under state_dag and puts their event_id: prev_state_events into a map.
// Performs no verification that the DAG is connected.
func LoadStateDAG(t ct.TestLike, sendJoinResp fclient.RespSendJoin) (joinEventID string, stateDAG map[string][]string) {
roomVer := gomatrixserverlib.MustGetRoomVersion(roomVersion)
joinEvent, err := roomVer.NewEventFromTrustedJSON(sendJoinResp.Event, false)
must.NotError(t, "failed to load join event", err)
stateDAG = make(map[string][]string) // event_id => prev_state_events
for _, stateEventJSON := range sendJoinResp.StateDAG {
stateEvent, err := roomVer.NewEventFromTrustedJSON(stateEventJSON, false)
must.NotError(t, "failed to load event json", err)
stateEventGJSON := gjson.ParseBytes(stateEventJSON)
prevStateEvents := prevStateEvents(t, stateEventGJSON)
stateDAG[stateEvent.EventID()] = prevStateEvents
}

// Include the prev state events of the join event itself as that is what we care about
joinEventPrevStateEvents := prevStateEvents(t, gjson.ParseBytes(sendJoinResp.Event))
stateDAG[joinEvent.EventID()] = joinEventPrevStateEvents
return joinEvent.EventID(), stateDAG
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadStateDAG is unused

return joinEvent.EventID(), stateDAG
}

// We have to implement our own /make|send_join dance here as gmsl is unaware of state DAGs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gmsl ?


func TestMain(m *testing.M) {
complement.TestMain(m, "msc4242")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: While I have read the entire MSC and this PR, I don't think I've onboarded enough to recognize if there are missing test cases. It's my first time with my own hand in the fire when it comes to State DAG's.

This PR is so large, it's probably worth having an LLM also sanity check this against the MSC for completeness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've shoved this through Claude and it has made some obvious suggestions:

Missing: G: has >0 prev_state_events and it is an m.room.create event

STATE03: XXX receiving a state event via /send which does NOT alter the current state (say it's a concurrent topic change which does not win state res) should | should not go down /sync and be present in the timeline.

Faster remote room joins is commented out.

And some less obvious ones:

No e2e soft-failure convergence test

The spec dedicates a section to the specific failure mode where soft-failing an event prevents a later server from seeing it and converging on correct state. SJ04C covers the fix (including concurrent events in /send_join), but there's no multi-server e2e test (msc4242_e2e_test.go only has the basic TestMSC4242FederationSimple). The spec's scenario: "Bystander soft-fails C; long-dead server rejoins and never learns C's effect on state" is not exercised.

  1. No test for prev_state_events cross-room rejection

The auth rules section adds:

If there are entries which do not belong in the same room, reject.

None of the faultyEventTestCases generates a prev_state_events entry pointing to a state event from a different room. This is distinct from "bogus event ID" (CodeSuffix A/B) as those are unknown IDs, not known-but-wrong-room IDs.

When probed for more missing test cases, it produced:

earliest_events never exercised with non-empty values

Every call to /get_missing_events across all tests passes EarliestEvents: []string{}. The spec explicitly defines the algorithm as mapping earliest_events to an initial "seen" set to bound the walk. There's no test verifying the server actually stops walking when it hits a known event ID. This matters because the recursive pagination in assertGetMissingEventsRecursively works around this by updating latest_events instead, so a broken earliest_events implementation would never be caught.

Outbound prev_state_events on non-state events never checked

The spec states all events — including messages — must carry prev_state_events. The tests verify the HS handles incoming events with various prev_state_events configurations, and TestMSC4242SendJoinSJ01Inbound checks prev_state_events on the join event specifically. But there's no test inspecting events sent by hs1 (messages, profile changes, etc.) to confirm they actually include prev_state_events pointing to the correct state DAG forward extremities. The E2E test only checks event IDs arrive, not event structure.

Duplicate event IDs in prev_state_events not tested

The spec removes the check for duplicate (type, state_key) pairs because prev_state_events can legitimately reference two events with the same tuple on different branches. But there's no faulty event test case for a prev_state_events list containing the same event ID twice. Whether this should be rejected (malformed) or tolerated (treated as deduplicated) isn't tested either way.

No test that /state, /state_ids, /event_auth are not called during normal operation

The spec explicitly obsoletes these three endpoints. While srv.UnexpectedRequestsAreErrors being true (the default) would catch this in some tests, several tests set it to false, and no test specifically asserts the HS does not fall back to these endpoints when processing events or calculating room state. A targeted test with a Complement server that returns 404 for all three endpoints and verifies the HS still converges correctly would strengthen this.

I'll consider these and resolve this issue when done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants