Skip to content

[OPIK-5270] [BE] fix: add trace_id_prefilter to scope trace query CTEs#5928

Merged
andrescrz merged 2 commits intomainfrom
andrescrz/OPIK-5270-fix-trace-stream-query-perf
Mar 27, 2026
Merged

[OPIK-5270] [BE] fix: add trace_id_prefilter to scope trace query CTEs#5928
andrescrz merged 2 commits intomainfrom
andrescrz/OPIK-5270-fix-trace-stream-query-perf

Conversation

@andrescrz
Copy link
Copy Markdown
Member

@andrescrz andrescrz commented Mar 27, 2026

Details

Adds a conditional trace_id_prefilter CTE to the SELECT_BY_PROJECT_ID trace query, matching the span_id_prefilter pattern from PRs #5625 and #5599. When trace-level filters (tags, search text) or search text are active, the prefilter narrows all enrichment CTEs (feedback scores, spans, guardrails, comments, annotations, experiments) to only process data for matching traces instead of scanning the entire project.

This prevents OOM kills caused by the arrayMap in feedback_scores_final attempting to allocate multi-GiB chunks when processing unscoped feedback scores for large projects.

  • Prefilter CTE: SELECT DISTINCT id FROM traces WHERE <filters>
  • 9 CTEs scoped via IN (SELECT id FROM trace_id_prefilter) with if/else fallback to uuid-range
  • Decision logic in shouldUseTraceIdPrefilter(): activates when narrowing filters exist, guards against feedback score filters and sort-by-feedback-scores
  • Both findTraceStream (streaming) and getTracesByProjectId (paginated) paths covered

Performance (measured on affected customer instance)

Metric Before (killed) After (prefilter)
Rows read 34,995,469 273,092
Bytes read 6.26 GiB 33.21 MiB
Duration killed / 11s+ 4.8s
Memory 21+ GiB (OOM at 13.97 GiB limit) 12.35 GiB

The remaining 12 GiB memory is inherent to the project's data volume (confirmed by a real application query without prefilter hitting 13.01 GiB and being killed with the same arrayMap OOM). The prefilter eliminates the 4-8 GiB arrayMap chunk allocation that pushes memory past the limit.

Without narrowing filters active, the query is unchanged (zero overhead).

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5270

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4.6
    • Scope: Implementation, analysis, and query optimization
    • Human verification: Tested directly against customer ClickHouse instance with before/after measurements

Testing

  • mvn compile passes
  • Manually tested the rendered SQL query directly against the affected ClickHouse instance:
    • Ran original killed query and prefiltered variant side by side
    • Verified identical result sets (6 rows)
    • Measured rows read, bytes read, memory, and duration via system.query_log
    • Confirmed the query no longer exceeds the 13.97 GiB memory limit
    • Verified the prefilter does not activate when no narrowing filters are present (no regression for default page loads)

Documentation

N/A — internal query optimization, no user-facing API changes.

🤖 Generated with Claude Code

@andrescrz andrescrz requested a review from a team as a code owner March 27, 2026 17:13
@github-actions github-actions bot added java Pull requests that update Java code Backend labels Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 7

1 238 tests   1 238 ✅  6m 23s ⏱️
   13 suites      0 💤
   13 files        0 ❌

Results for commit 0937597.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Backend Tests - Integration Group 4

1 485 tests   1 485 ✅  9m 27s ⏱️
    8 suites      0 💤
    8 files        0 ❌

Results for commit 0937597.

♻️ This comment has been updated with latest results.

Comment on lines +855 to +864
WITH <if(trace_id_prefilter)>trace_id_prefilter AS (
SELECT DISTINCT id
FROM traces
WHERE workspace_id = :workspace_id
AND project_id = :project_id
<if(last_received_id)> AND id \\< :last_received_id <endif>
<if(uuid_from_time)> AND id >= :uuid_from_time <endif>
<if(uuid_to_time)> AND id \\<= :uuid_to_time <endif>
<if(filters)> AND <filters> <endif>
<if(search_text)> AND <search_text> <endif>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we need either the order-by to dedup or final; otherwise, the search may match the wrong rows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thiagohora as we only need the ids here, DISTINCT should be enough.

Copy link
Copy Markdown
Contributor

@thiagohora thiagohora Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But without the final sorting/dedup, the filters and search_text could match older versions. no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — I analyzed this thoroughly. The prefilter uses SELECT DISTINCT id without dedup (no FINAL, no LIMIT 1 BY). This means old trace versions from non-merged parts could match the filter ("phantom" IDs). Here's why that's safe:

Every phantom trace ID is neutralized downstream:

  1. Final SELECT LEFT JOINs: All enrichment CTEs (feedback_scores, spans, comments, guardrails, experiments) are LEFT JOINed with traces_final, which comes from traces_deduped. traces_deduped applies the same <filters> with proper ORDER BY ... LIMIT 1 BY id dedup. Phantom traces never appear in traces_final, so their enrichment data is discarded by the JOINs.

  2. CTE-dependent filters in traces_deduped: When feedback_scores_filters, feedback_scores_empty_filters, span_feedback_scores_filters, span_feedback_scores_empty_filters, or guardrails_filters are active, shouldUseTraceIdPrefilter disables the prefilter entirely — so these paths never see phantom data.

  3. Unguarded filters (trace_aggregation_filters, annotation_queue_filters): Even if phantom T1's span/annotation data passes these checks in traces_deduped, T1 still fails the <filters> condition (applied with LIMIT 1 BY id) on the traces table itself. Multiple conditions are ANDed — phantom can't survive.

  4. Never under-inclusive: If the latest trace version matches the filter, that row exists in the table and DISTINCT will find it. The prefilter is always a superset, never misses real matches.

Cost of adding dedup: FINAL or ORDER BY + LIMIT 1 BY on every evaluation across 9 CTE references. Since the prefilter is purely a scoping optimization (not the authoritative filter), this cost has no correctness benefit.

🤖 Reply posted via /address-github-pr-comments

The guardrails filter injects gagg.guardrails_result into the
<filters> template variable, referencing the guardrails_agg CTE
alias. Since trace_id_prefilter only queries FROM traces, this
reference fails with UNKNOWN_IDENTIFIER.

Guard against guardrails_filters in shouldUseTraceIdPrefilter to
disable the prefilter when guardrails filters are active. Renamed
the guard variable to hasCteDependentFilters for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
var template = newTraceThreadFindTemplate(SELECT_BY_PROJECT_ID, criteria, TRACE_SEARCH_CLAUSE);
template.add("log_comment", logComment);

if (shouldUseTraceIdPrefilter(criteria, template)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we don't use && !sortHasFeedbackScores here like we do in the other shouldUseTraceIdPrefilter usage as findTraceStream has no sorting?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — findTraceStream has no sorting, so there's no orderBySql to check. The !sortHasFeedbackScores guard only applies in the paginated path (getTracesByProjectId) where sorting by feedback scores requires the full unscoped feedback_scores CTE for the sort JOIN.

🤖 Reply posted via /address-github-pr-comments

Comment on lines +855 to +864
WITH <if(trace_id_prefilter)>trace_id_prefilter AS (
SELECT DISTINCT id
FROM traces
WHERE workspace_id = :workspace_id
AND project_id = :project_id
<if(last_received_id)> AND id \\< :last_received_id <endif>
<if(uuid_from_time)> AND id >= :uuid_from_time <endif>
<if(uuid_to_time)> AND id \\<= :uuid_to_time <endif>
<if(filters)> AND <filters> <endif>
<if(search_text)> AND <search_text> <endif>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thiagohora as we only need the ids here, DISTINCT should be enough.

Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement!

@andrescrz andrescrz merged commit 3b5bee4 into main Mar 27, 2026
76 checks passed
@andrescrz andrescrz deleted the andrescrz/OPIK-5270-fix-trace-stream-query-perf branch March 27, 2026 17:52
andrescrz added a commit that referenced this pull request Mar 27, 2026
#5928)

* [OPIK-5270] [BE] fix: scope trace stream query CTEs to matching traces

* fix(trace-dao): guard prefilter against guardrails_filters

The guardrails filter injects gagg.guardrails_result into the
<filters> template variable, referencing the guardrails_agg CTE
alias. Since trace_id_prefilter only queries FROM traces, this
reference fails with UNKNOWN_IDENTIFIER.

Guard against guardrails_filters in shouldUseTraceIdPrefilter to
disable the prefilter when guardrails filters are active. Renamed
the guard variable to hasCteDependentFilters for clarity.

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants