Skip to content

Conversation

@jsokol805
Copy link

Which problem is this PR solving?

OpenTelemetry OTLP/HTTP specification states:

If the server disconnects without returning a response, the client SHOULD retry and send the same request. The client SHOULD implement an exponential backoff strategy between retries to avoid overwhelming the server.
...
If the client cannot connect to the server, the client SHOULD retry the connection using an exponential backoff strategy between retries.  The interval between retries must have a random jitter.

The backoff infrastrucure was in place, it was just the glue code to request APIs (fetch, http, xhr) that was reporting non-retryable state in case of errors that might be temporary.

Short description of the changes

Added an utility function that categorizes if error from transport is plausibly a network one, then adjusted all 3 transports (fetch, http, XHR) to use it when handling errors.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • Unit tests for all functionalities

Checklist:

  • Followed the style guidelines of this project
  • Unit tests have been added

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 21, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@jsokol805 jsokol805 force-pushed the fix/otlp-network-errors-retryable branch 2 times, most recently from aba8e84 to 4722eeb Compare November 21, 2025 10:46
…xport

OpenTelemetry OTLP/HTTP specification states:

```
If the server disconnects without returning a response, the client SHOULD
retry and send the same request. The client SHOULD implement an exponential
backoff strategy between retries to avoid overwhelming the server.
...
If the client cannot connect to the server, the client SHOULD retry the
connection using an exponential backoff strategy between retries.
The interval between retries must have a random jitter.
```

The backoff infrastrucure was in place, it was just the glue code to
request APIs (fetch, http, xhr) that was reporting non-retryable state
in case of errors that might be temporary.
@jsokol805 jsokol805 force-pushed the fix/otlp-network-errors-retryable branch from 4722eeb to 29233f3 Compare November 21, 2025 10:48
@codecov
Copy link

codecov bot commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.37%. Comparing base (c071e6e) to head (29233f3).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...xporter-base/src/transport/http-transport-utils.ts 60.00% 2 Missing ⚠️
...ages/otlp-exporter-base/src/is-export-retryable.ts 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6147      +/-   ##
==========================================
- Coverage   95.39%   95.37%   -0.03%     
==========================================
  Files         316      316              
  Lines        9374     9398      +24     
  Branches     2166     2175       +9     
==========================================
+ Hits         8942     8963      +21     
- Misses        432      435       +3     
Files with missing lines Coverage Δ
...ages/otlp-exporter-base/src/is-export-retryable.ts 97.14% <95.45%> (-2.86%) ⬇️
...xporter-base/src/transport/http-transport-utils.ts 95.74% <60.00%> (-4.26%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@pichlermarc pichlermarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - thanks for working on this! 🙂
This is going in the right direction - we just need to make sure that we don't swallow all errors now that most outcomes of the HTTP request are retryable, so that our end-users can still troubleshoot, if necessary.

Comment on lines 67 to 72
if (isExportNetworkErrorRetryable(error)) {
return {
status: 'failure',
error: new Error('Fetch request timed out', { cause: error }),
status: 'retryable',
retryInMillis: 0,
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this transport is fully intended for the browser. It's for browser and webworkers. It will therefore never receive any undici errors.

Comment on lines 107 to 110
onDone({
status: 'retryable',
retryInMillis: 0,
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavoir quite significantly:

  • errors are completely swallowed. If this always fails, the end-user will never be able to see a log what actually went wrong.
    • suggestion: I think to solve this we should allow attaching an optional error to a retryable status so that the RetryingTransport can propagate these back to the export delegate, which then logs it.
  • retryInMillis: 0 might not be what we want. the RetryingTransport implements an exponential backoff as required by the spec. IMO we should have this backoff happen to avoid overwhelming the endpoint.
  • suggestion: I think we should omit retryInMillis here to let the retryingtransport handle it.

status: 'failure',
error: new Error('XHR request timed out'),
status: 'retryable',
retryInMillis: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this timeout here is also the full maximum of the request - if this is ever triggered, there's no time left for the export to retry.

status: 'failure',
error: new Error('XHR request errored'),
status: 'retryable',
retryInMillis: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would omit retryInMillis here to let the RetryingTransport have the exponential backoff.

Comment on lines 52 to 55
'UND_ERR_CONNECT_TIMEOUT',
'UND_ERR_HEADERS_TIMEOUT',
'UND_ERR_BODY_TIMEOUT',
'UND_ERR_SOCKET',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any code path right now that would produce undici errors.

Comment on lines 195 to 218
it('returns retryable when fetch throws network error with code', function (done) {
// arrange
const cause = new Error('network error') as NodeJS.ErrnoException;
cause.code = 'ECONNRESET';
const networkError = new TypeError('fetch failed', { cause });
sinon.stub(globalThis, 'fetch').rejects(networkError);
const transport = createFetchTransport(testTransportParameters);

//act
transport.send(testPayload, requestTimeout).then(response => {
// assert
try {
assert.strictEqual(response.status, 'retryable');
assert.strictEqual(
(response as ExportResponseRetryable).retryInMillis,
0
);
} catch (e) {
done(e);
}
done();
}, done /* catch any rejections */);
});
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to test any Node.js things here - this is a browser-targeted test for a browser-targeted component.

- Remove reference to undici errors
- Add diagnostic verbose/info logs to we can better understand whats happening
  during e2e test
- Fix how jitter gets applied (before it was adding 0.2 to timeout in miliseconds)
- Add error reasons to retryeable errors; ensure that errors code gets passed
@jsokol805 jsokol805 marked this pull request as ready for review November 29, 2025 09:24
@jsokol805 jsokol805 requested a review from a team as a code owner November 29, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants