-
Notifications
You must be signed in to change notification settings - Fork 953
fix(otlp-exporter-base): ensure retry on network errors during HTTP export #6147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(otlp-exporter-base): ensure retry on network errors during HTTP export #6147
Conversation
aba8e84 to
4722eeb
Compare
…xport OpenTelemetry OTLP/HTTP specification states: ``` If the server disconnects without returning a response, the client SHOULD retry and send the same request. The client SHOULD implement an exponential backoff strategy between retries to avoid overwhelming the server. ... If the client cannot connect to the server, the client SHOULD retry the connection using an exponential backoff strategy between retries. The interval between retries must have a random jitter. ``` The backoff infrastrucure was in place, it was just the glue code to request APIs (fetch, http, xhr) that was reporting non-retryable state in case of errors that might be temporary.
4722eeb to
29233f3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6147 +/- ##
==========================================
- Coverage 95.39% 95.37% -0.03%
==========================================
Files 316 316
Lines 9374 9398 +24
Branches 2166 2175 +9
==========================================
+ Hits 8942 8963 +21
- Misses 432 435 +3
🚀 New features to boost your workflow:
|
pichlermarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - thanks for working on this! 🙂
This is going in the right direction - we just need to make sure that we don't swallow all errors now that most outcomes of the HTTP request are retryable, so that our end-users can still troubleshoot, if necessary.
| if (isExportNetworkErrorRetryable(error)) { | ||
| return { | ||
| status: 'failure', | ||
| error: new Error('Fetch request timed out', { cause: error }), | ||
| status: 'retryable', | ||
| retryInMillis: 0, | ||
| }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this transport is fully intended for the browser. It's for browser and webworkers. It will therefore never receive any undici errors.
| onDone({ | ||
| status: 'retryable', | ||
| retryInMillis: 0, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the behavoir quite significantly:
- errors are completely swallowed. If this always fails, the end-user will never be able to see a log what actually went wrong.
- suggestion: I think to solve this we should allow attaching an optional error to a retryable status so that the RetryingTransport can propagate these back to the export delegate, which then logs it.
retryInMillis: 0might not be what we want. the RetryingTransport implements an exponential backoff as required by the spec. IMO we should have this backoff happen to avoid overwhelming the endpoint.- suggestion: I think we should omit
retryInMillishere to let the retryingtransport handle it.
| status: 'failure', | ||
| error: new Error('XHR request timed out'), | ||
| status: 'retryable', | ||
| retryInMillis: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this timeout here is also the full maximum of the request - if this is ever triggered, there's no time left for the export to retry.
| status: 'failure', | ||
| error: new Error('XHR request errored'), | ||
| status: 'retryable', | ||
| retryInMillis: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I would omit retryInMillis here to let the RetryingTransport have the exponential backoff.
| 'UND_ERR_CONNECT_TIMEOUT', | ||
| 'UND_ERR_HEADERS_TIMEOUT', | ||
| 'UND_ERR_BODY_TIMEOUT', | ||
| 'UND_ERR_SOCKET', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any code path right now that would produce undici errors.
| it('returns retryable when fetch throws network error with code', function (done) { | ||
| // arrange | ||
| const cause = new Error('network error') as NodeJS.ErrnoException; | ||
| cause.code = 'ECONNRESET'; | ||
| const networkError = new TypeError('fetch failed', { cause }); | ||
| sinon.stub(globalThis, 'fetch').rejects(networkError); | ||
| const transport = createFetchTransport(testTransportParameters); | ||
|
|
||
| //act | ||
| transport.send(testPayload, requestTimeout).then(response => { | ||
| // assert | ||
| try { | ||
| assert.strictEqual(response.status, 'retryable'); | ||
| assert.strictEqual( | ||
| (response as ExportResponseRetryable).retryInMillis, | ||
| 0 | ||
| ); | ||
| } catch (e) { | ||
| done(e); | ||
| } | ||
| done(); | ||
| }, done /* catch any rejections */); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to test any Node.js things here - this is a browser-targeted test for a browser-targeted component.
- Remove reference to undici errors - Add diagnostic verbose/info logs to we can better understand whats happening during e2e test - Fix how jitter gets applied (before it was adding 0.2 to timeout in miliseconds) - Add error reasons to retryeable errors; ensure that errors code gets passed
Which problem is this PR solving?
OpenTelemetry OTLP/HTTP specification states:
The backoff infrastrucure was in place, it was just the glue code to request APIs (fetch, http, xhr) that was reporting non-retryable state in case of errors that might be temporary.
Short description of the changes
Added an utility function that categorizes if error from transport is plausibly a network one, then adjusted all 3 transports (fetch, http, XHR) to use it when handling errors.
Type of change
How Has This Been Tested?
Checklist: