-
-
Notifications
You must be signed in to change notification settings - Fork 60
feat(eap): gdpr export endpoint #7586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| literal(page_token.last_seen_timestamp), | ||
| ), | ||
| f.greater( | ||
| f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took this from get trace pagination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a GDPR export endpoint for EAP (Event Analytics Platform) and fixes an issue where the query pipeline was transforming columns in the ORDER BY clause, which was defeating ClickHouse optimizations and making pagination ineffective.
Key Changes:
- Added new
EndpointExportTraceItemsRPC endpoint with pagination support for exporting trace items - Introduced
skip_transform_order_byflag in query settings to preserve ORDER BY clause column names for optimal ClickHouse performance - Refactored common array processing functions (
process_arrays,transform_array_value) andBUCKET_COUNTconstant to shared utilities
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updated sentry-protos dependency from 0.4.8 to 0.4.9 to support new export endpoint protobuf definitions |
| pyproject.toml | Updated sentry-protos version requirement to match lock file |
| tests/web/rpc/v1/test_utils.py | Added BASE_TIME constant definition and timezone import for test consistency |
| tests/web/rpc/v1/test_endpoint_export_trace_items.py | Added comprehensive tests for the new export endpoint including pagination and order by transformation validation |
| snuba/web/rpc/v1/endpoint_export_trace_items.py | Implemented new GDPR export endpoint with cursor-based pagination and proper attribute handling |
| snuba/web/rpc/common/common.py | Extracted shared array processing utilities and BUCKET_COUNT constant for reuse across endpoints |
| snuba/web/rpc/v1/endpoint_trace_item_details.py | Updated to use shared BUCKET_COUNT constant from common module |
| snuba/web/rpc/v1/endpoint_get_trace.py | Refactored to use shared process_arrays function from common module |
| snuba/query/query_settings.py | Added skip_transform_order_by setting to control ORDER BY clause transformation |
| snuba/query/init.py | Added skip_transform_order_by parameter to transform_expressions method |
| snuba/query/processors/physical/type_converters.py | Updated to respect skip_transform_order_by setting during query processing |
| snuba/pipeline/stages/query_processing.py | Modified storage processing stage to pass skip_transform_order_by setting through transformation pipeline |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert response.page_token.end_pagination == False | ||
| else: | ||
| assert response.page_token.end_pagination == True |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use is False instead of == False for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.
| assert response.page_token.end_pagination == False | |
| else: | |
| assert response.page_token.end_pagination == True | |
| assert response.page_token.end_pagination is False | |
| else: | |
| assert response.page_token.end_pagination is True |
| ts = row.pop("timestamp") | ||
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | ||
| server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | ||
| sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841 |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable sampling_factor is not used.
| sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841 | |
| row.pop("sampling_factor", None) |
| integers = row.pop("attributes_int", {}) or {} | ||
| floats = row.pop("attributes_float", {}) or {} | ||
|
|
||
| breakpoint() |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0)) | ||
| server_sample_rate = float(1.0 / row.pop("server_sample_rate", 1.0)) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| item_id = row.pop("id") | ||
| item_type = row.pop("item_type") | ||
| ts = row.pop("timestamp") | ||
| client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this
| *[column(f"attributes_float_{n}") for n in range(BUCKET_COUNT)], | ||
| alias="attributes_float", | ||
| ), | ||
| ), | ||
| SelectedExpression("attributes_int", column("attributes_int", alias="attributes_int")), | ||
| SelectedExpression("attributes_bool", column("attributes_bool", alias="attributes_bool")), | ||
| SelectedExpression( | ||
| "attributes_array", | ||
| FunctionCall("attributes_array", "toJSONString", (column("attributes_array"),)), | ||
| ), | ||
| ] |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
https://linear.app/getsentry/issue/EAP-320/data-export-endpoint
While working on this PR, we also uncovered that the query pipeline transforms the columns in the order by clause. This is unintended for EAP at least because this defeats the CH optimizations with sort keys, and makes pagination ineffective. My fix for this was to pass in a flag that tells the query pipeline whether or not to do the transformation on the order by clause. This is the safest, simplest, and fastest solution that is also clean (the query pipeline is used by other Snuba stuff that I'm unfamiliar with).