Draft: `/messages` investigation scratch pad1 #13440

MadLittleMods · 2022-08-03T01:25:33Z

Combine:

OTEL migration Draft: Migrate to OpenTelemetry tracing #13400
with /messages @trace decorations, Instrument /messages for understandable traces in Jaeger #13368

So that I can run against Complement federation tests and see if there is more to add @trace to in the federation stack of things when /messages happens.

Optimization ideas

We load a lot of state (from 2. in #13356)

In #matrixhq there are 40k current members and I assume get_current_state is the root cause why we Loaded 79277 events (seems like that took 17s too). We only call get_current_state in order to get a list of likely domains to backfill from.

We could optimize this by:

Cache get_domains_from_state so we don't have to get_current_state as much
Use a smarter heuristic to get the first couple domains to try to fetch from and do the get_domains_from_state in the background so it's ready by the time we fail with the first couple of domains.

Skip backfill

Skip backfill or kick it off in the background if it's not our first time and we have enough events.

We don't want to get stuck on the same unfetchable event over and over.

Why is `/state_ids` slow to respond?

We can't control every bad network effect but maybe Synapse is slow to assemble a /state_ids reponse 🤔 Need to investigate

FederationStateIdsServlet

Sometimes it's slow because the ratelimiter kicks in (traced in Instrument FederationStateIdsServlet - /state_ids #13499)

We should only care about `auth_event_ids`

We should only care about getting the event_id and auth_event_ids in _get_state_ids_after_missing_prev_event(...)

We shouldn't factor state_event_ids into whether

Dev notes

TEST_ONLY_IGNORE_POETRY_LOCKFILE=1 TEST_ONLY_SKIP_DEP_HASH_VERIFICATION=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestMessagesOverFederation -p 1

TEST_ONLY_IGNORE_POETRY_LOCKFILE=1 TEST_ONLY_SKIP_DEP_HASH_VERIFICATION=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestImportHistoricalMessages/parallel/Federation/Backfill_still_works_after_many_batches_are_imported -p 1

poetry run synapse_homeserver --config-path homeserver.yaml

Jaeger max duration spans

213503982d 8h, see #13440 (comment)

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Pull request includes a sign off
Code style is correct
(run the linters)

…lemods/13356-messages-investigation-scratch-v1 Conflicts: synapse/api/auth.py

MadLittleMods · 2022-08-03T01:27:29Z

docker/conf/homeserver.yaml

+    # It does not seem like the agent can keep up with the massive UDP load
+    # (1065 spans in one trace) so lets just use the HTTP collector endpoint
+    # instead which seems to work.


I wonder why this is the case? I was seeing this same behavior with the Jaeger opentracing stuff. Is the UDP connection being over saturated? Can the Jaeger agent in Docker not keep up? We see some spans come over but never the main servlet overarching one that is probably the last to be exported.

But using the HTTP Jaeger collector endpoint seems to work fine for getting the whole trace.

…ittlemods/13356-messages-investigation-scratch-v1

docker/conf/homeserver.yaml

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: pyproject.toml synapse/logging/tracing.py

Split out from #13440

From feedback in #13499

Instrument the federation/backfill part of `/messages` so it's easier to follow what's going on in Jaeger when viewing a trace. Split out from #13440 Follow-up from #13368 Part of #13356

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: synapse/federation/federation_client.py synapse/handlers/federation.py synapse/handlers/federation_event.py synapse/logging/tracing.py synapse/storage/controllers/persist_events.py synapse/storage/controllers/state.py synapse/storage/databases/main/events_worker.py synapse/util/ratelimitutils.py

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: poetry.lock synapse/handlers/federation.py

…ittlemods/13356-messages-investigation-scratch-v1

clokep · 2022-10-19T19:45:55Z

@MadLittleMods Is this useful or have you gleaned everything you can from it?

MadLittleMods · 2022-10-19T20:12:34Z

@clokep It's not as useful for its original purpose. But it's being depended on by #13864 which I have used recently

This PR is useful because it is based on #13400 which has the OpenTelemetry changes. And this PR adds the Docker Complement changes to see traces from the tests

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: synapse/handlers/federation.py synapse/handlers/relations.py

MadLittleMods added 4 commits July 22, 2022 22:29

Instrument /messages for understandable traces

522c29b

Trace in Complement

b6a18d2

Merge branch 'madlittlemods/instrument-messages-tracing' into madlitt…

2504bc6

…lemods/13356-messages-investigation-scratch-v1 Conflicts: synapse/api/auth.py

Fix imports after OTEL changes

9cd6320

MadLittleMods added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Aug 3, 2022

MadLittleMods commented Aug 3, 2022

View reviewed changes

MadLittleMods added 4 commits August 3, 2022 17:05

Merge branch 'madlittlemods/11850-migrate-to-opentelemetry' into madl…

c3f3e59

…ittlemods/13356-messages-investigation-scratch-v1

Move Twisted git install where it was before

9f69182

Fix @tag_args being one-off (ahead)

2f75287

Allow @trace and @tag_args to be used together

fdce1c2

MadLittleMods mentioned this pull request Aug 4, 2022

Allow use of both @trace and @tag_args stacked on the same function #13453

Merged

4 tasks

MadLittleMods added 5 commits August 4, 2022 01:24

Trace more

a7eabb7

More tracing for federated side

13855c5

More tracing for federation

552b7f1

Add length to the list of events

c51883e

Fix some lints (mistakes) and better trace when fetching events

ee465f9

MadLittleMods commented Aug 8, 2022

View reviewed changes

docker/conf/homeserver.yaml Show resolved Hide resolved

MadLittleMods added 3 commits August 8, 2022 20:37

Only set attribute if going forward

aa5e925

Merge branch 'madlittlemods/11850-migrate-to-opentelemetry' into madl…

2a467fd

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: pyproject.toml synapse/logging/tracing.py

Trace some results

597c3f2

MadLittleMods added a commit that referenced this pull request Aug 10, 2022

Instrument the federation/backfill part of /messages

aeaa36d

Split out from #13440

This was referenced Aug 10, 2022

Instrument the federation/backfill part of /messages #13489

Merged

Draft: Federation /messages tests matrix-org/complement#443

Closed

MadLittleMods added 3 commits August 10, 2022 18:13

Instrument FederationStateIdsServlet

f4ec9d1

More tracing

898ba0e

Refactor from feedback

53b8453

From feedback in #13499

MadLittleMods added 3 commits August 18, 2022 16:51

Some cleanup

db04b16

Remove debug logs

4168ba5

MadLittleMods added 2 commits September 20, 2022 18:04

Merge branch 'madlittlemods/11850-migrate-to-opentelemetry' into madl…

05e5113

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: poetry.lock synapse/handlers/federation.py

Merge branch 'madlittlemods/11850-migrate-to-opentelemetry' into madl…

d8899e4

…ittlemods/13356-messages-investigation-scratch-v1

Merge branch 'madlittlemods/11850-migrate-to-opentelemetry' into madl…

04de9ea

…ittlemods/13356-messages-investigation-scratch-v1 Conflicts: synapse/handlers/federation.py synapse/handlers/relations.py

MadLittleMods added the A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) label Apr 25, 2023

MadLittleMods closed this May 19, 2023

MadLittleMods mentioned this pull request May 20, 2023

Draft: Migrate to OpenTelemetry tracing #13400

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Draft: `/messages` investigation scratch pad1 #13440

Draft: `/messages` investigation scratch pad1 #13440

Uh oh!

MadLittleMods commented Aug 3, 2022 •

edited

Loading

Uh oh!

MadLittleMods Aug 3, 2022 •

edited

Loading

Uh oh!

Uh oh!

clokep commented Oct 19, 2022

Uh oh!

MadLittleMods commented Oct 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Draft: /messages investigation scratch pad1 #13440

Draft: /messages investigation scratch pad1 #13440

Uh oh!

Conversation

MadLittleMods commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimization ideas

We load a lot of state (from 2. in #13356)

Skip backfill

Why is /state_ids slow to respond?

We should only care about auth_event_ids

Dev notes

Jaeger max duration spans

Pull Request Checklist

Uh oh!

MadLittleMods Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clokep commented Oct 19, 2022

Uh oh!

MadLittleMods commented Oct 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Draft: `/messages` investigation scratch pad1 #13440

Draft: `/messages` investigation scratch pad1 #13440

MadLittleMods commented Aug 3, 2022 •

edited

Loading

Why is `/state_ids` slow to respond?

We should only care about `auth_event_ids`

MadLittleMods Aug 3, 2022 •

edited

Loading