Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Latest Synapse 1.70.0rc1 (2022-10-19) release seems to have regressed /messages response time #14250

@MadLittleMods

Description

@MadLittleMods

I think we rolled out the latest Synapse 1.70.0rc1 (2022-10-19) to matrix.org this morning. Looking at the graphs (where the spike on the right is), I noticed there seems to be a big up-tick in /messages response times.

We should check to see whether this persists tomorrow or is just a blip.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=now-4d&to=now

It looks like we're seeing a major regression in the before backfill processing time (synapse/handlers/federation.py#L197-L503). And some in the after backfill processing time (synapse/handlers/federation_event.py#L643-L657.

What changed?

See https://github.com/matrix-org/synapse/blob/develop/CHANGES.md#synapse-1700rc1-2022-10-19

Is it #13816?

One obvious culprit is #13816 but I would expect everything that it touches to be in the after backfill processing timing. And the goal of that is to save time since we bail early if are trying to process an event we know will fail.

My /messages?dir=b&limit=500 benchmark request to Matrix HQ still takes 24s to complete (expected).. The one thing that stood out to me was a new recursively fetching redactions span under _process_pulled_events in the trace.

Is it #14164?

This one seems suspicious. Given the new recursively fetching redactions span I saw under _process_pulled_events in the Jaeger trace which _get_events_from_db does and changed in this PR. And that we removed a cache.

This would account for the after backfill processing time. Just smoke though. I don't know how it performs. We do end up running _process_pulled_event less since we only do it for new events.

What else?

What about the before backfill processing time?

Needs investigation.


I wish I had a trace that represented one of these long requests. Should be something longer than 180s

But we only recorded two traces in the past 2 days over 120s, https://jaeger.proxy.matrix.org/search?end=1666298780924000&limit=20&lookback=2d&maxDuration&minDuration=120s&operation=RoomMessageListRestServlet&service=matrix.org%20client\_reader&start=1666125980924000

And they seem more related to outgoing-federation-request for /backfill which we don't control, RequestSendFailed. And wouldn't account for before/after backfill processing times

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-Messages-Endpoint/messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill)A-PerformancePerformance, both client-facing and admin-facingO-OccasionalAffects or can be seen by some users regularly or most users rarelyS-MinorBlocks non-critical functionality, workarounds exist.T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.X-RegressionSomething broke which worked on a previous release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions