Skip to content

feat(spans): add option to rollout process_segment task#119003

Open
lvthanh03 wants to merge 2 commits into
masterfrom
tony/rollout-task-segment
Open

feat(spans): add option to rollout process_segment task#119003
lvthanh03 wants to merge 2 commits into
masterfrom
tony/rollout-task-segment

Conversation

@lvthanh03

@lvthanh03 lvthanh03 commented Jul 3, 2026

Copy link
Copy Markdown
Member

Refs STREAM-1308

Adds a rollout path for the spans buffer flusher to enqueue flushed segments directly to the process_segment task.

When spans.buffer.process-segments-task-rollout-rate is enabled, segment bytes are sent to the new task instead of being produced to the buffered-segments topic.

@lvthanh03 lvthanh03 requested review from a team as code owners July 3, 2026 13:29
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label Jul 3, 2026
flushed_segment.project_id,
kafka_payload,
len(message["spans"]),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis cleared before task enqueue

Medium Severity

When the rollout sends a segment via process_segment_task.apply_async, that call is not tied to the existing producer_futures / wait_produce path. The flusher still calls done_flush_segments for the whole batch right after, so Redis cleanup can run before task delivery is confirmed, unlike the Kafka produce path that waits on futures first.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d899e7. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is legit. i wonder if we should change taskbroker-client so that apply_async returns a future, and we can track it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is legit. Taskbroker client has wait_for_delivery, I think this would be a good use case?

@lvthanh03 lvthanh03 Jul 3, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/getsentry/taskbroker/blob/main/clients/python/src/taskbroker_client/registry.py#L207-L211

seems like with wait_for_delivery the taskbroker client catches delivery failure and only logs it, so we still move on to done_flush_segment even if delivery fails and not crash the spans buffer. This matches the current behaviour, the difference is that we also track invalid outcomes if the current produce fails.

We should find a way for apply_async to return the producer future, then flusher appends that future to the current producer_futures if we want to track negative outcomes, but with outcomes being tracked in EAP instead, I doubt this is a concern?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should return the future. with wait_for_delivery the problem is that we have many futures we want to await on in parallel, not block the flusher per-produce.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/sentry/spans/consumers/process/flusher.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6579d20. Configure here.

flushed_segment.project_id,
kafka_payload,
len(message["spans"]),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task enqueue errors crash flusher

High Severity

When the rollout path enqueues via process_segment_task.apply_async, delivery or broker errors propagate uncaught out of the flush iteration and terminate the flusher process. Kafka handoffs catch produce failures in wait_produce, record outcomes, and still run cleanup, so the two paths behave inconsistently and a mixed batch can leave in-flight Kafka produces unawaited.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6579d20. Configure here.

@linear-code

linear-code Bot commented Jul 3, 2026

Copy link
Copy Markdown

STREAM-1308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants