Conversation
apps/opik-backend/src/main/java/com/comet/opik/domain/TraceDAO.java
Outdated
Show resolved
Hide resolved
Backend Tests - Integration Group 71 238 tests 1 238 ✅ 6m 23s ⏱️ Results for commit 0937597. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 41 485 tests 1 485 ✅ 9m 27s ⏱️ Results for commit 0937597. ♻️ This comment has been updated with latest results. |
| WITH <if(trace_id_prefilter)>trace_id_prefilter AS ( | ||
| SELECT DISTINCT id | ||
| FROM traces | ||
| WHERE workspace_id = :workspace_id | ||
| AND project_id = :project_id | ||
| <if(last_received_id)> AND id \\< :last_received_id <endif> | ||
| <if(uuid_from_time)> AND id >= :uuid_from_time <endif> | ||
| <if(uuid_to_time)> AND id \\<= :uuid_to_time <endif> | ||
| <if(filters)> AND <filters> <endif> | ||
| <if(search_text)> AND <search_text> <endif> |
There was a problem hiding this comment.
Here, we need either the order-by to dedup or final; otherwise, the search may match the wrong rows.
There was a problem hiding this comment.
@thiagohora as we only need the ids here, DISTINCT should be enough.
There was a problem hiding this comment.
But without the final sorting/dedup, the filters and search_text could match older versions. no?
There was a problem hiding this comment.
Good catch — I analyzed this thoroughly. The prefilter uses SELECT DISTINCT id without dedup (no FINAL, no LIMIT 1 BY). This means old trace versions from non-merged parts could match the filter ("phantom" IDs). Here's why that's safe:
Every phantom trace ID is neutralized downstream:
-
Final SELECT LEFT JOINs: All enrichment CTEs (feedback_scores, spans, comments, guardrails, experiments) are LEFT JOINed with
traces_final, which comes fromtraces_deduped.traces_dedupedapplies the same<filters>with properORDER BY ... LIMIT 1 BY iddedup. Phantom traces never appear intraces_final, so their enrichment data is discarded by the JOINs. -
CTE-dependent filters in
traces_deduped: Whenfeedback_scores_filters,feedback_scores_empty_filters,span_feedback_scores_filters,span_feedback_scores_empty_filters, orguardrails_filtersare active,shouldUseTraceIdPrefilterdisables the prefilter entirely — so these paths never see phantom data. -
Unguarded filters (
trace_aggregation_filters,annotation_queue_filters): Even if phantom T1's span/annotation data passes these checks intraces_deduped, T1 still fails the<filters>condition (applied withLIMIT 1 BY id) on the traces table itself. Multiple conditions are ANDed — phantom can't survive. -
Never under-inclusive: If the latest trace version matches the filter, that row exists in the table and
DISTINCTwill find it. The prefilter is always a superset, never misses real matches.
Cost of adding dedup: FINAL or ORDER BY + LIMIT 1 BY on every evaluation across 9 CTE references. Since the prefilter is purely a scoping optimization (not the authoritative filter), this cost has no correctness benefit.
🤖 Reply posted via /address-github-pr-comments
The guardrails filter injects gagg.guardrails_result into the <filters> template variable, referencing the guardrails_agg CTE alias. Since trace_id_prefilter only queries FROM traces, this reference fails with UNKNOWN_IDENTIFIER. Guard against guardrails_filters in shouldUseTraceIdPrefilter to disable the prefilter when guardrails filters are active. Renamed the guard variable to hasCteDependentFilters for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| var template = newTraceThreadFindTemplate(SELECT_BY_PROJECT_ID, criteria, TRACE_SEARCH_CLAUSE); | ||
| template.add("log_comment", logComment); | ||
|
|
||
| if (shouldUseTraceIdPrefilter(criteria, template)) { |
There was a problem hiding this comment.
I'm assuming we don't use && !sortHasFeedbackScores here like we do in the other shouldUseTraceIdPrefilter usage as findTraceStream has no sorting?
There was a problem hiding this comment.
Correct — findTraceStream has no sorting, so there's no orderBySql to check. The !sortHasFeedbackScores guard only applies in the paginated path (getTracesByProjectId) where sorting by feedback scores requires the full unscoped feedback_scores CTE for the sort JOIN.
🤖 Reply posted via /address-github-pr-comments
| WITH <if(trace_id_prefilter)>trace_id_prefilter AS ( | ||
| SELECT DISTINCT id | ||
| FROM traces | ||
| WHERE workspace_id = :workspace_id | ||
| AND project_id = :project_id | ||
| <if(last_received_id)> AND id \\< :last_received_id <endif> | ||
| <if(uuid_from_time)> AND id >= :uuid_from_time <endif> | ||
| <if(uuid_to_time)> AND id \\<= :uuid_to_time <endif> | ||
| <if(filters)> AND <filters> <endif> | ||
| <if(search_text)> AND <search_text> <endif> |
There was a problem hiding this comment.
@thiagohora as we only need the ids here, DISTINCT should be enough.
#5928) * [OPIK-5270] [BE] fix: scope trace stream query CTEs to matching traces * fix(trace-dao): guard prefilter against guardrails_filters The guardrails filter injects gagg.guardrails_result into the <filters> template variable, referencing the guardrails_agg CTE alias. Since trace_id_prefilter only queries FROM traces, this reference fails with UNKNOWN_IDENTIFIER. Guard against guardrails_filters in shouldUseTraceIdPrefilter to disable the prefilter when guardrails filters are active. Renamed the guard variable to hasCteDependentFilters for clarity. ---------
Details
Adds a conditional
trace_id_prefilterCTE to theSELECT_BY_PROJECT_IDtrace query, matching thespan_id_prefilterpattern from PRs #5625 and #5599. When trace-level filters (tags, search text) or search text are active, the prefilter narrows all enrichment CTEs (feedback scores, spans, guardrails, comments, annotations, experiments) to only process data for matching traces instead of scanning the entire project.This prevents OOM kills caused by the
arrayMapinfeedback_scores_finalattempting to allocate multi-GiB chunks when processing unscoped feedback scores for large projects.SELECT DISTINCT id FROM traces WHERE <filters>IN (SELECT id FROM trace_id_prefilter)with if/else fallback to uuid-rangeshouldUseTraceIdPrefilter(): activates when narrowing filters exist, guards against feedback score filters and sort-by-feedback-scoresfindTraceStream(streaming) andgetTracesByProjectId(paginated) paths coveredPerformance (measured on affected customer instance)
The remaining 12 GiB memory is inherent to the project's data volume (confirmed by a real application query without prefilter hitting 13.01 GiB and being killed with the same
arrayMapOOM). The prefilter eliminates the 4-8 GiBarrayMapchunk allocation that pushes memory past the limit.Without narrowing filters active, the query is unchanged (zero overhead).
Change checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
mvn compilepassessystem.query_logDocumentation
N/A — internal query optimization, no user-facing API changes.
🤖 Generated with Claude Code