Skip to content

Conversation

@xurui-c
Copy link
Member

@xurui-c xurui-c commented Dec 10, 2025

https://linear.app/getsentry/issue/EAP-320/data-export-endpoint

While working on this PR, we also uncovered that the query pipeline transforms the columns in the order by clause. This is unintended for EAP at least because this defeats the CH optimizations with sort keys, and makes pagination ineffective. My fix for this was to pass in a flag that tells the query pipeline whether or not to do the transformation on the order by clause. This is the safest, simplest, and fastest solution that is also clean (the query pipeline is used by other Snuba stuff that I'm unfamiliar with).

  • Will fix the get trace endpoint in a follow up PR in the interest of keeping this PR small

@xurui-c xurui-c changed the title Rachel/gdpr feat(eap): gdpr export endpoint Dec 10, 2025
literal(page_token.last_seen_timestamp),
),
f.greater(
f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took this from get trace pagination

c
almost done

get rid of some stuff

clean

item id

item id

page token done?

pagination

cleanup

cleanup

pagination

strings

remove comment

index

optimize

idk

smth up with item_id

fixed

cleanup

c
@xurui-c xurui-c marked this pull request as ready for review December 22, 2025 21:28
@xurui-c xurui-c requested review from a team as code owners December 22, 2025 21:28
Copilot AI review requested due to automatic review settings December 29, 2025 20:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a GDPR export endpoint for EAP (Event Analytics Platform) and fixes an issue where the query pipeline was transforming columns in the ORDER BY clause, which was defeating ClickHouse optimizations and making pagination ineffective.

Key Changes:

  • Added new EndpointExportTraceItems RPC endpoint with pagination support for exporting trace items
  • Introduced skip_transform_order_by flag in query settings to preserve ORDER BY clause column names for optimal ClickHouse performance
  • Refactored common array processing functions (process_arrays, transform_array_value) and BUCKET_COUNT constant to shared utilities

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
uv.lock Updated sentry-protos dependency from 0.4.8 to 0.4.9 to support new export endpoint protobuf definitions
pyproject.toml Updated sentry-protos version requirement to match lock file
tests/web/rpc/v1/test_utils.py Added BASE_TIME constant definition and timezone import for test consistency
tests/web/rpc/v1/test_endpoint_export_trace_items.py Added comprehensive tests for the new export endpoint including pagination and order by transformation validation
snuba/web/rpc/v1/endpoint_export_trace_items.py Implemented new GDPR export endpoint with cursor-based pagination and proper attribute handling
snuba/web/rpc/common/common.py Extracted shared array processing utilities and BUCKET_COUNT constant for reuse across endpoints
snuba/web/rpc/v1/endpoint_trace_item_details.py Updated to use shared BUCKET_COUNT constant from common module
snuba/web/rpc/v1/endpoint_get_trace.py Refactored to use shared process_arrays function from common module
snuba/query/query_settings.py Added skip_transform_order_by setting to control ORDER BY clause transformation
snuba/query/init.py Added skip_transform_order_by parameter to transform_expressions method
snuba/query/processors/physical/type_converters.py Updated to respect skip_transform_order_by setting during query processing
snuba/pipeline/stages/query_processing.py Modified storage processing stage to pass skip_transform_order_by setting through transformation pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +122 to +124
assert response.page_token.end_pagination == False
else:
assert response.page_token.end_pagination == True
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use is False instead of == False for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.

Suggested change
assert response.page_token.end_pagination == False
else:
assert response.page_token.end_pagination == True
assert response.page_token.end_pagination is False
else:
assert response.page_token.end_pagination is True

Copilot uses AI. Check for mistakes.
ts = row.pop("timestamp")
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable sampling_factor is not used.

Suggested change
sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841
row.pop("sampling_factor", None)

Copilot uses AI. Check for mistakes.
integers = row.pop("attributes_int", {}) or {}
floats = row.pop("attributes_float", {}) or {}

breakpoint()

This comment was marked as outdated.

Comment on lines 327 to 328
client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0))
server_sample_rate = float(1.0 / row.pop("server_sample_rate", 1.0))

This comment was marked as outdated.

item_id = row.pop("id")
item_type = row.pop("item_type")
ts = row.pop("timestamp")
client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this

Rachel Chen added 2 commits January 5, 2026 11:37
Comment on lines +186 to +196
*[column(f"attributes_float_{n}") for n in range(BUCKET_COUNT)],
alias="attributes_float",
),
),
SelectedExpression("attributes_int", column("attributes_int", alias="attributes_int")),
SelectedExpression("attributes_bool", column("attributes_bool", alias="attributes_bool")),
SelectedExpression(
"attributes_array",
FunctionCall("attributes_array", "toJSONString", (column("attributes_array"),)),
),
]

This comment was marked as outdated.

Rachel Chen added 2 commits January 5, 2026 11:53
@xurui-c xurui-c merged commit 9f6c68b into master Jan 5, 2026
34 checks passed
@xurui-c xurui-c deleted the rachel/gdpr branch January 5, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants