Skip to content

fix: request aborted shoudn't result in a fetch error#2741

Draft
SkArchon wants to merge 1 commit intomainfrom
milinda/eng-8828-router-request-aborted-shoudnt-result-in-a-fetch-error
Draft

fix: request aborted shoudn't result in a fetch error#2741
SkArchon wants to merge 1 commit intomainfrom
milinda/eng-8828-router-request-aborted-shoudnt-result-in-a-fetch-error

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented Apr 6, 2026

This PR fixes places where we mark spans as errors for client disconnects.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Client disconnections are no longer incorrectly marked as server errors in traces and metrics, improving observability and preventing inflated error counts.
    • Error handling for client disconnects now properly distinguishes them from actual server errors across telemetry collection.
  • Tests

    • Added comprehensive test coverage for client disconnect scenarios across telemetry and error handling flows.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/docs-website.
  • I have read the Contributors Guide.

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

@github-actions github-actions bot added the router label Apr 6, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

Walkthrough

This pull request modifies client disconnection error handling across the router's request lifecycle. Changes ensure that context.Canceled errors are recorded as observability events on spans but do not mark spans as ERROR, do not count as request errors in metrics, and prevent HTTP 500 response writes. The changes span error handling, span/metric recording, transport layer, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Core Error Handling
router/core/errors.go, router/core/batch.go
Short-circuits writeOperationError for context.Canceled; conditionally records cancellation errors on spans without marking them as ERROR in batch processing.
Request Instrumentation
router/core/engine_loader_hooks.go
Refactors fetch error handling into new recordFetchError helper; treats context.Canceled as non-error for tracing status while still recording events; always executes request count/latency metrics after error handling block.
Operation & Transport Handlers
router/core/graphql_prehandler.go, router/pkg/trace/transport.go
Updates operation handler to record context.Canceled errors on spans without setting ERROR status; sets transport span status to codes.Ok for client disconnections.
Engine Loader Instrumentation Tests
router/core/engine_loader_hooks_test.go
New test file with 427 lines covering OnFinished and recordFetchError behavior; validates non-ERROR span status for cancellations, error event recording, metric invocation patterns, and downstream error code aggregation.
Metrics Recording Tests
router/core/operation_metrics_test.go
Updates spyMetricStore to capture attribute slices passed to MeasureRequestError for verification.
Transport Layer Tests
router/pkg/trace/transport_test.go
Adds test case verifying context.Canceled errors produce codes.Ok span status rather than codes.Error in HTTP transport layer.
End-to-End Telemetry Tests
router-tests/telemetry/span_error_status_test.go
Adds three integration subtests under TestClientDisconnectionBehavior validating non-ERROR spans during client disconnects, persisted-operation fetch timeouts, and batched request disconnects; verifies exception events are recorded without 500 responses.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main fix: preventing client disconnects from marking spans as fetch errors.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-503d8ea4aa64c63036488db9446059b8031f2c29-nonroot

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 88.33333% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.77%. Comparing base (65e05e3) to head (1e7948c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
router/core/batch.go 20.00% 2 Missing and 2 partials ⚠️
router/core/engine_loader_hooks.go 95.12% 1 Missing and 1 partial ⚠️
router/core/graphql_prehandler.go 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2741      +/-   ##
==========================================
- Coverage   63.46%   57.77%   -5.69%     
==========================================
  Files         251      235      -16     
  Lines       26767    26204     -563     
==========================================
- Hits        16987    15140    -1847     
- Misses       8414     9603    +1189     
- Partials     1366     1461      +95     
Files with missing lines Coverage Δ
router/core/errors.go 79.90% <100.00%> (+0.19%) ⬆️
router/pkg/trace/transport.go 100.00% <100.00%> (ø)
router/core/graphql_prehandler.go 84.63% <88.88%> (-0.06%) ⬇️
router/core/engine_loader_hooks.go 91.01% <95.12%> (-0.96%) ⬇️
router/core/batch.go 75.59% <20.00%> (-7.34%) ⬇️

... and 118 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/core/batch.go (1)

183-212: ⚠️ Potential issue | 🟠 Major

Return immediately for canceled batch requests.

This branch still falls through to writeRequestErrors, so a batched disconnect will try to serialize a GraphQL error after we've already classified the failure as a client-side cancellation. That can reintroduce write-side noise and mutate the observed status in the batch path.

💡 Proposed fix
func processBatchError(w http.ResponseWriter, r *http.Request, err error, requestLogger *zap.Logger) {
-	if errors.Is(err, context.Canceled) {
-		span := trace.SpanFromContext(r.Context())
-		span.RecordError(err)
-	} else {
-		ctrace.AttachErrToSpanFromContext(r.Context(), err)
-	}
+	if errors.Is(err, context.Canceled) {
+		trace.SpanFromContext(r.Context()).RecordError(err)
+		return
+	}
+
+	ctrace.AttachErrToSpanFromContext(r.Context(), err)

	requestError := graphqlerrors.RequestError{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/batch.go` around lines 183 - 212, In processBatchError, the
context.Canceled branch records the span but then falls through to
writeRequestErrors; modify processBatchError so that when errors.Is(err,
context.Canceled) is true you RecordError on the span (as now) and then
immediately return to avoid calling writeRequestErrors and serializing a GraphQL
error for client cancellations; keep the existing else branch
(ctrace.AttachErrToSpanFromContext) and the subsequent handling for non-canceled
errors intact.
🧹 Nitpick comments (2)
router/core/engine_loader_hooks_test.go (2)

373-392: Assert the merged attrs at the metric-store boundary too.

Right now this only checks the returned slice. The test would still pass if recordFetchError returned the merged attrs but called MeasureRequestError with the old slice.

Suggested assertion
 		resultSlice, _ := hooks.recordFetchError(ctx, span, fetchErr, rc, nil, metricAddOpt, prePopulated)
 		span.End()
@@
 		require.True(t, hasExisting, "pre-populated attrs should be preserved")
 		require.True(t, hasErrorCodes, "error codes should be appended")
+
+		require.True(t, store.requestErrorCalled, "MeasureRequestError should be called")
+
+		var metricHasExisting, metricHasErrorCodes bool
+		for _, attr := range store.requestErrorSliceAttr {
+			if string(attr.Key) == "existing.attr" {
+				metricHasExisting = true
+			}
+			if string(attr.Key) == "graphql.error.codes" {
+				metricHasErrorCodes = true
+			}
+		}
+		require.True(t, metricHasExisting,
+			"MeasureRequestError should receive the pre-populated attrs")
+		require.True(t, metricHasErrorCodes,
+			"MeasureRequestError should receive the appended error codes")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/engine_loader_hooks_test.go` around lines 373 - 392, The test
currently only asserts the returned slice from hooks.recordFetchError contains
merged attributes; extend it to also verify that the metric-store call received
the merged slice: capture the arguments passed to the mock
MetricStore.MeasureRequestError (or the equivalent MeasureRequestError spy used
by hooks), and assert that the captured attrs include both the prePopulated
attribute ("existing.attr") and the "graphql.error.codes" attribute (same checks
used for resultSlice). Update references to the mock/spy used by the hooks setup
(e.g., the MetricStore mock instance) and reuse
metricAddOpt/prePopulated/resultSlice names to ensure MeasureRequestError was
invoked with the merged attrs.

107-130: Assert the exception event for wrapped cancellations too.

This case currently only proves the status/metric behavior. If the wrapped context.Canceled path stops emitting the observability event, the regression would still pass.

Suggested assertion
 		require.NotEqual(t, codes.Error, spans[0].Status().Code,
 			"wrapped context.Canceled should not set span status to Error")
+		require.Len(t, spans[0].Events(), 1,
+			"wrapped context.Canceled should still be recorded as a span event")
+		require.Equal(t, "exception", spans[0].Events()[0].Name)
 		require.False(t, store.requestErrorCalled,
 			"MeasureRequestError should not be called for wrapped context.Canceled")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/engine_loader_hooks_test.go` around lines 107 - 130, The test
"wrapped context.Canceled does not set span ERROR status" currently omits
asserting that an observability exception event is still emitted for wrapped
cancellations; update the test after calling hooks.OnFinished(ctx, ds,
&resolve.ResponseInfo{Err: wrappedErr}) to inspect the recorded span (spans[0])
events and assert that there is an "exception" event (or an event whose
attributes indicate an exception/exception.type/exception.message matching
wrappedErr) so the wrapped context.Canceled path still emits the expected
exception event; use the existing exporter.GetSpans().Snapshots() and
spans[0].Events() APIs to locate and assert the event.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router-tests/telemetry/span_error_status_test.go`:
- Around line 188-191: Replace the fixed time.Sleep usage with
require.Eventually to avoid races when waiting for async span/log export:
instead of sleeping then calling exporter.GetSpans().Snapshots() and
require.NotEmpty(t, spans), call require.Eventually with a short polling
interval and timeout and inside the poll invoke exporter.GetSpans().Snapshots()
and assert len(spans) > 0 (or use require.NotEmpty within the closure) so the
test waits until all expected spans are exported; update the same pattern found
around the other occurrences that use time.Sleep before checking
exporter.GetSpans().Snapshots() (the instances at the other mentioned blocks
should be changed similarly).

In `@router/core/engine_loader_hooks.go`:
- Around line 274-331: The recordFetchError helper currently records the error
and metrics but never sets the span attribute that marks non-canceled request
failures; update recordFetchError to set the span attribute
rotel.WgRequestError.Bool(true) for real fetch failures (i.e., when fetchErr is
not context.Canceled and not context.DeadlineExceeded or equivalent cancellation
checks) before returning so traces and metrics stay in sync; reference the
recordFetchError function and rotel.WgRequestError to locate where to add
span.SetAttributes(...) alongside the existing metricAttrs append and
otelmetric.WithAttributeSet creation.

---

Outside diff comments:
In `@router/core/batch.go`:
- Around line 183-212: In processBatchError, the context.Canceled branch records
the span but then falls through to writeRequestErrors; modify processBatchError
so that when errors.Is(err, context.Canceled) is true you RecordError on the
span (as now) and then immediately return to avoid calling writeRequestErrors
and serializing a GraphQL error for client cancellations; keep the existing else
branch (ctrace.AttachErrToSpanFromContext) and the subsequent handling for
non-canceled errors intact.

---

Nitpick comments:
In `@router/core/engine_loader_hooks_test.go`:
- Around line 373-392: The test currently only asserts the returned slice from
hooks.recordFetchError contains merged attributes; extend it to also verify that
the metric-store call received the merged slice: capture the arguments passed to
the mock MetricStore.MeasureRequestError (or the equivalent MeasureRequestError
spy used by hooks), and assert that the captured attrs include both the
prePopulated attribute ("existing.attr") and the "graphql.error.codes" attribute
(same checks used for resultSlice). Update references to the mock/spy used by
the hooks setup (e.g., the MetricStore mock instance) and reuse
metricAddOpt/prePopulated/resultSlice names to ensure MeasureRequestError was
invoked with the merged attrs.
- Around line 107-130: The test "wrapped context.Canceled does not set span
ERROR status" currently omits asserting that an observability exception event is
still emitted for wrapped cancellations; update the test after calling
hooks.OnFinished(ctx, ds, &resolve.ResponseInfo{Err: wrappedErr}) to inspect the
recorded span (spans[0]) events and assert that there is an "exception" event
(or an event whose attributes indicate an
exception/exception.type/exception.message matching wrappedErr) so the wrapped
context.Canceled path still emits the expected exception event; use the existing
exporter.GetSpans().Snapshots() and spans[0].Events() APIs to locate and assert
the event.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 95ba86f9-0380-4da8-8e23-5351e8b8d117

📥 Commits

Reviewing files that changed from the base of the PR and between 2326f9c and 1e7948c.

📒 Files selected for processing (9)
  • router-tests/telemetry/span_error_status_test.go
  • router/core/batch.go
  • router/core/engine_loader_hooks.go
  • router/core/engine_loader_hooks_test.go
  • router/core/errors.go
  • router/core/graphql_prehandler.go
  • router/core/operation_metrics_test.go
  • router/pkg/trace/transport.go
  • router/pkg/trace/transport_test.go

Comment on lines +188 to +191
time.Sleep(500 * time.Millisecond)

spans := exporter.GetSpans().Snapshots()
require.NotEmpty(t, spans)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the new fixed sleeps with require.Eventually.

These assertions depend on async span/log export, so time.Sleep(500 * time.Millisecond) is still racy under slower CI and can miss late spans or log entries.

As per coding guidelines, "For periodic exporters, wait for ALL expected items using require.Eventually, not just one sentinel value, to avoid race conditions with export cycles".

Also applies to: 292-295, 361-363

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/telemetry/span_error_status_test.go` around lines 188 - 191,
Replace the fixed time.Sleep usage with require.Eventually to avoid races when
waiting for async span/log export: instead of sleeping then calling
exporter.GetSpans().Snapshots() and require.NotEmpty(t, spans), call
require.Eventually with a short polling interval and timeout and inside the poll
invoke exporter.GetSpans().Snapshots() and assert len(spans) > 0 (or use
require.NotEmpty within the closure) so the test waits until all expected spans
are exported; update the same pattern found around the other occurrences that
use time.Sleep before checking exporter.GetSpans().Snapshots() (the instances at
the other mentioned blocks should be changed similarly).

Comment on lines +274 to +331
func (f *engineLoaderHooks) recordFetchError(
ctx context.Context,
span trace.Span,
fetchErr error,
reqContext *requestContext,
metricAttrs []attribute.KeyValue,
metricAddOpt otelmetric.AddOption,
metricSliceAttrs []attribute.KeyValue,
) ([]attribute.KeyValue, otelmetric.MeasurementOption) {
rtrace.SetSanitizedSpanStatus(span, codes.Error, fetchErr.Error())
span.RecordError(fetchErr)

// Extract downstream error codes from subgraph errors
var errorCodesAttr []string

if unwrapped, ok := fetchErr.(multiError); ok {
for _, e := range unwrapped.Unwrap() {
var subgraphError *resolve.SubgraphError
if !errors.As(e, &subgraphError) {
continue
}

for i, downstreamError := range subgraphError.DownstreamErrors {
var errorCode string
if downstreamError.Extensions != nil {
if value := downstreamError.Extensions.Get("code"); value != nil {
errorCode = string(value.GetStringBytes())
}
}

if errorCode == "" {
continue
}

errorCodesAttr = append(errorCodesAttr, errorCode)
span.AddEvent(fmt.Sprintf("Downstream error %d", i+1),
trace.WithAttributes(
rotel.WgSubgraphErrorExtendedCode.String(errorCode),
rotel.WgSubgraphErrorMessage.String(downstreamError.Message),
),
)
}
}

errorCodesAttr = unique.SliceElements(errorCodesAttr)
// Reduce cardinality of error codes
slices.Sort(errorCodesAttr)
}

metricSliceAttrs := *reqContext.telemetry.AcquireAttributes()
defer reqContext.telemetry.ReleaseAttributes(&metricSliceAttrs)
metricSliceAttrs = append(metricSliceAttrs, reqContext.telemetry.metricSliceAttrs...)

// We can't add this earlier because this is done per subgraph response
if v, ok := reqContext.telemetry.metricSetAttrs[ContextFieldGraphQLErrorCodes]; ok && len(errorCodesAttr) > 0 {
metricSliceAttrs = append(metricSliceAttrs, attribute.StringSlice(v, errorCodesAttr))
}

f.metricStore.MeasureRequestError(ctx, metricSliceAttrs, metricAddOpt)
if v, ok := reqContext.telemetry.metricSetAttrs[ContextFieldGraphQLErrorCodes]; ok && len(errorCodesAttr) > 0 {
metricSliceAttrs = append(metricSliceAttrs, attribute.StringSlice(v, errorCodesAttr))
}

metricAttrs = append(metricAttrs, rotel.WgRequestError.Bool(true))
f.metricStore.MeasureRequestError(ctx, metricSliceAttrs, metricAddOpt)

attrOpt := otelmetric.WithAttributeSet(attribute.NewSet(metricAttrs...))
f.metricStore.MeasureRequestCount(ctx, metricSliceAttrs, attrOpt)
f.metricStore.MeasureLatency(ctx, latency, metricSliceAttrs, attrOpt)
} else {
f.metricStore.MeasureRequestCount(ctx, reqContext.telemetry.metricSliceAttrs, metricAddOpt)
f.metricStore.MeasureLatency(ctx, latency, reqContext.telemetry.metricSliceAttrs, metricAddOpt)
}
metricAttrs = append(metricAttrs, rotel.WgRequestError.Bool(true))
attrOpt := otelmetric.WithAttributeSet(attribute.NewSet(metricAttrs...))

span.SetAttributes(traceAttrs...)
return metricSliceAttrs, attrOpt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore wg.request.error=true on real fetch failures.

This helper replaced the previous rtrace.AttachErrToSpan path, but it no longer sets the span attribute that marks non-canceled request failures. The metric attrs still get wg.request.error=true, so traces and metrics will drift apart for the same fetch error.

💡 Proposed fix
func (f *engineLoaderHooks) recordFetchError(
	ctx context.Context,
	span trace.Span,
	fetchErr error,
	reqContext *requestContext,
	metricAttrs []attribute.KeyValue,
	metricAddOpt otelmetric.AddOption,
	metricSliceAttrs []attribute.KeyValue,
) ([]attribute.KeyValue, otelmetric.MeasurementOption) {
	rtrace.SetSanitizedSpanStatus(span, codes.Error, fetchErr.Error())
+	span.SetAttributes(rotel.WgRequestError.Bool(true))
	span.RecordError(fetchErr)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (f *engineLoaderHooks) recordFetchError(
ctx context.Context,
span trace.Span,
fetchErr error,
reqContext *requestContext,
metricAttrs []attribute.KeyValue,
metricAddOpt otelmetric.AddOption,
metricSliceAttrs []attribute.KeyValue,
) ([]attribute.KeyValue, otelmetric.MeasurementOption) {
rtrace.SetSanitizedSpanStatus(span, codes.Error, fetchErr.Error())
span.RecordError(fetchErr)
// Extract downstream error codes from subgraph errors
var errorCodesAttr []string
if unwrapped, ok := fetchErr.(multiError); ok {
for _, e := range unwrapped.Unwrap() {
var subgraphError *resolve.SubgraphError
if !errors.As(e, &subgraphError) {
continue
}
for i, downstreamError := range subgraphError.DownstreamErrors {
var errorCode string
if downstreamError.Extensions != nil {
if value := downstreamError.Extensions.Get("code"); value != nil {
errorCode = string(value.GetStringBytes())
}
}
if errorCode == "" {
continue
}
errorCodesAttr = append(errorCodesAttr, errorCode)
span.AddEvent(fmt.Sprintf("Downstream error %d", i+1),
trace.WithAttributes(
rotel.WgSubgraphErrorExtendedCode.String(errorCode),
rotel.WgSubgraphErrorMessage.String(downstreamError.Message),
),
)
}
}
errorCodesAttr = unique.SliceElements(errorCodesAttr)
// Reduce cardinality of error codes
slices.Sort(errorCodesAttr)
}
metricSliceAttrs := *reqContext.telemetry.AcquireAttributes()
defer reqContext.telemetry.ReleaseAttributes(&metricSliceAttrs)
metricSliceAttrs = append(metricSliceAttrs, reqContext.telemetry.metricSliceAttrs...)
// We can't add this earlier because this is done per subgraph response
if v, ok := reqContext.telemetry.metricSetAttrs[ContextFieldGraphQLErrorCodes]; ok && len(errorCodesAttr) > 0 {
metricSliceAttrs = append(metricSliceAttrs, attribute.StringSlice(v, errorCodesAttr))
}
f.metricStore.MeasureRequestError(ctx, metricSliceAttrs, metricAddOpt)
if v, ok := reqContext.telemetry.metricSetAttrs[ContextFieldGraphQLErrorCodes]; ok && len(errorCodesAttr) > 0 {
metricSliceAttrs = append(metricSliceAttrs, attribute.StringSlice(v, errorCodesAttr))
}
metricAttrs = append(metricAttrs, rotel.WgRequestError.Bool(true))
f.metricStore.MeasureRequestError(ctx, metricSliceAttrs, metricAddOpt)
attrOpt := otelmetric.WithAttributeSet(attribute.NewSet(metricAttrs...))
f.metricStore.MeasureRequestCount(ctx, metricSliceAttrs, attrOpt)
f.metricStore.MeasureLatency(ctx, latency, metricSliceAttrs, attrOpt)
} else {
f.metricStore.MeasureRequestCount(ctx, reqContext.telemetry.metricSliceAttrs, metricAddOpt)
f.metricStore.MeasureLatency(ctx, latency, reqContext.telemetry.metricSliceAttrs, metricAddOpt)
}
metricAttrs = append(metricAttrs, rotel.WgRequestError.Bool(true))
attrOpt := otelmetric.WithAttributeSet(attribute.NewSet(metricAttrs...))
span.SetAttributes(traceAttrs...)
return metricSliceAttrs, attrOpt
func (f *engineLoaderHooks) recordFetchError(
ctx context.Context,
span trace.Span,
fetchErr error,
reqContext *requestContext,
metricAttrs []attribute.KeyValue,
metricAddOpt otelmetric.AddOption,
metricSliceAttrs []attribute.KeyValue,
) ([]attribute.KeyValue, otelmetric.MeasurementOption) {
rtrace.SetSanitizedSpanStatus(span, codes.Error, fetchErr.Error())
span.SetAttributes(rotel.WgRequestError.Bool(true))
span.RecordError(fetchErr)
// Extract downstream error codes from subgraph errors
var errorCodesAttr []string
if unwrapped, ok := fetchErr.(multiError); ok {
for _, e := range unwrapped.Unwrap() {
var subgraphError *resolve.SubgraphError
if !errors.As(e, &subgraphError) {
continue
}
for i, downstreamError := range subgraphError.DownstreamErrors {
var errorCode string
if downstreamError.Extensions != nil {
if value := downstreamError.Extensions.Get("code"); value != nil {
errorCode = string(value.GetStringBytes())
}
}
if errorCode == "" {
continue
}
errorCodesAttr = append(errorCodesAttr, errorCode)
span.AddEvent(fmt.Sprintf("Downstream error %d", i+1),
trace.WithAttributes(
rotel.WgSubgraphErrorExtendedCode.String(errorCode),
rotel.WgSubgraphErrorMessage.String(downstreamError.Message),
),
)
}
}
errorCodesAttr = unique.SliceElements(errorCodesAttr)
slices.Sort(errorCodesAttr)
}
if v, ok := reqContext.telemetry.metricSetAttrs[ContextFieldGraphQLErrorCodes]; ok && len(errorCodesAttr) > 0 {
metricSliceAttrs = append(metricSliceAttrs, attribute.StringSlice(v, errorCodesAttr))
}
f.metricStore.MeasureRequestError(ctx, metricSliceAttrs, metricAddOpt)
metricAttrs = append(metricAttrs, rotel.WgRequestError.Bool(true))
attrOpt := otelmetric.WithAttributeSet(attribute.NewSet(metricAttrs...))
return metricSliceAttrs, attrOpt
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/engine_loader_hooks.go` around lines 274 - 331, The
recordFetchError helper currently records the error and metrics but never sets
the span attribute that marks non-canceled request failures; update
recordFetchError to set the span attribute rotel.WgRequestError.Bool(true) for
real fetch failures (i.e., when fetchErr is not context.Canceled and not
context.DeadlineExceeded or equivalent cancellation checks) before returning so
traces and metrics stay in sync; reference the recordFetchError function and
rotel.WgRequestError to locate where to add span.SetAttributes(...) alongside
the existing metricAttrs append and otelmetric.WithAttributeSet creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant