-
Notifications
You must be signed in to change notification settings - Fork 12
feat(trace): use dynamic composite ratio sampler #1070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Doh - license generation fails because of open-telemetry/opentelemetry-js#6002. Is it possible to skip some packages or we should wait for that to be released? |
Fixed in commit 46b6c3d |
Some thoughts on how EDOT Node.js (and perhaps the other langs) should handle sampling, for cases when the specified SDK Proposal: The "EDOT SDK default sampler" should be (in pseudo-code)
Considered, but not proposed:
@anuraaga If you have opinions, please let me know if this seems reasonable to you. I can bring it up at our APM agents weekly meeting next week to see about getting concensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We should make sure there is a checklist item or a separate issue for docs for the sampling differences. I'm happy to help write those.
* @param {number} ratio | ||
* @returns {Sampler} A ratio sampler which can have its ratio updated dynamically. | ||
*/ | ||
function createDynamicCompositeParentThresholdTraceIdRatioBasedSampler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying not to laugh at the length of this function name. :)
Thanks @trentm - that proposal is basically how I think it should work too. If you can help make sure everyone is on the same page, I can update Java to it too. One point will be determining a OTEL_TRACES_SAMPLER value for this new sampler. It's possibly not strictly required in JS where we imperatively create the SDK but in others we would use the official distro mechanisms to implement that which would expect the value. I created a value, currently it is not used because the sampler is forced on Can use or change it, it's another point to get consensus on so all the languages behave the same. Thanks! |
I agree with the proposal. Note that it's likely that the default sampler chain for the Java agent will include the rule-based-sampler too, or equivalent, as this is the declared mechanism for filtering out spans based on routes, eg no span created for |
…ic-otel-node into composite-sampler
…nto composite-sampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the help @trentm - I have updated based on the proposed behavior
initialConfig.sampling_rate = sdkInfo.samplingRate; | ||
initialConfig.send_traces = !sdkInfo.contextPropagationOnly; | ||
log.debug({initialConfig}, 'initial central config values'); | ||
lastAppliedConfig = {...initialConfig}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't test the undefined
behavior without this, and it seems to intuitively make sense to treat the initial config as the first applied config. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results some (the ones with initialConfig
entries) of the central config setter
s being run when receiving a first config. For example, if I run an OpAMP server that will send an empty config:
% cd packages/mockopampserver
% cat config-empty.json
{}
% node lib/cli.js -F [email protected] | bunyan
...
Then run the examples/central-config.js against that, then (when central config is first received) we get unnecessary re-application of default values, and some slightly misleading logging:
% ELASTIC_OTEL_OPAMP_ENDPOINT=localhost:4320 \
ELASTIC_OTEL_EXPERIMENTAL_OPAMP_HEARTBEAT_INTERVAL=5000 \
node --import @elastic/opentelemetry-node central-config.js
{"name":"elastic-otel-node","level":30,"preamble":true,"distroVersion":"1.5.0","env":{"os":"darwin 24.6.0","arch":"arm64","runtime":"Node.js v22.17.1"},"config":{"logLevel":"info"},"msg":"start EDOT Node.js","time":"2025-10-17T20:10:08.455Z"}
{"name":"elastic-otel-node","level":40,"msg":"Unsupported OTLP metrics protocol: \"undefined\". Using http/protobuf.","time":"2025-10-17T20:10:08.476Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"logging_level\" to \"info\"","time":"2025-10-17T20:10:08.529Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"send_traces\" to \"true\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"sampling_rate\" to \"1\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"config":{},"appliedKeys":["logging_level","send_traces","sampling_rate"],"msg":"successfully applied remote config","time":"2025-10-17T20:10:08.530Z"}
...
I understand that this means the sampling_rate: undefined
case cannot be tested with the "central-config-gen-telelemetry.js" fixture. FWIW, the logging_level: undefined
case is being tested with a custom fixture that dynamically changes the central config during its run:
elastic-otel-node/packages/opentelemetry-node/test/fixtures/central-config-logging-level.js
Lines 45 to 63 in 2090bdc
if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) { | |
await setElasticConfig({logging_level: 'debug'}); | |
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]); | |
} | |
// 3. Now the `diag.debug` should result in an emitted record. | |
diag.verbose('verbose2'); | |
diag.debug('debug2'); | |
diag.info('info2'); | |
diag.warn('warn2'); | |
diag.error('error2'); | |
// 4. Set central config to empty config, to simulate an Agent Configuration | |
// in Kibana being removed. We expect the EDOT Node.js SDK to reset back | |
// to the default (info) log level. | |
if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) { | |
await setElasticConfig({}); | |
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]); | |
} |
Personally I'm fine with not having an automated test for the undefined
case here: complexity/code-size of the tests is a trade-off. Or, I wouldn't object to a test case doing live-updating of the central config (as in the logging_level
example).
For consideration, I'll push a change that removes this and drops the test case for undefined
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consideration, I'll push a change that removes this and drops the test case for
undefined
.
commit ee65525
AFAICT the test failure is unrelated to this PR and is about metrics, but after several retries it seems to still always fail the same way |
I cannot reproduce locally on "main" running just that test file, and using the same Node.js v18 version:
Update: I cannot repro locally running the full test suite: I'm trying in #1080 with a small innocuous change to "main". |
The tests are failing in test/central-config.test.js with Node.js 18 on #1070 Using this PR to poke at that.
The test failure is in test/central-config.test.js and there are changes to central config things in the PR, so it isn't out of the realm of possibilities it is due to changes in this PR. |
I can repro locally on this feature branch running just the test file and with node 18. It does no fail with Node.js v20.
|
See this commit message from the PR that added this (complex)
Yuck. I don't recall ever chasing down **why behaviour of metric batches was different between Node v18.20.8 and v18.19.0 versions. However, indeed the tests pass with v18.19.0:
Yuck. The answer then was to use the added |
Yes that works. For example, this change fixes the @@ -854,14 +862,23 @@ test('central-config', (suite) => {
},
// verbose: true,
checkTelemetry: (t, col, stdout) => {
- assertCentralConfigGenTelemetry(t, col, [
+ const ccAppliedRe = /^CENTRAL_CONFIG_APPLIED: (\d+)$/m;
+ const ccAppliedStr = ccAppliedRe.exec(stdout)[1];
+ const ccAppliedNs = BigInt(ccAppliedStr) * 1000000n;
+ assertCentralConfigGenTelemetry(
+ t,
+ col,
+ [
'spans',
'metrics',
'logs',
'instr-runtime-node',
'instr-undici',
'instr-http',
- ]);
+ ],
+ ccAppliedNs
+ );
+
const recs = stdout
.split(/\r?\n/g)
.filter((ln) => ln.startsWith('{')) However, I think these Sampler-related test really only care about spans being included or not, so let's just count spans and ignore the more general @@ -854,14 +862,8 @@ test('central-config', (suite) => {
},
// verbose: true,
checkTelemetry: (t, col, stdout) => {
- assertCentralConfigGenTelemetry(t, col, [
- 'spans',
- 'metrics',
- 'logs',
- 'instr-runtime-node',
- 'instr-undici',
- 'instr-http',
- ]);
+ t.equal(col.sortedSpans.length, 3, 'got all the spans');
+
const recs = stdout
.split(/\r?\n/g)
.filter((ln) => ln.startsWith('{')) passes that test case:
|
… whether the sampling rate is as expected) This bypasses the gross diff in metric collection batch behaviour with just Node.js v18.20.8 (the latest current version). See comments on the PR.
initialConfig.sampling_rate = sdkInfo.samplingRate; | ||
initialConfig.send_traces = !sdkInfo.contextPropagationOnly; | ||
log.debug({initialConfig}, 'initial central config values'); | ||
lastAppliedConfig = {...initialConfig}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results some (the ones with initialConfig
entries) of the central config setter
s being run when receiving a first config. For example, if I run an OpAMP server that will send an empty config:
% cd packages/mockopampserver
% cat config-empty.json
{}
% node lib/cli.js -F [email protected] | bunyan
...
Then run the examples/central-config.js against that, then (when central config is first received) we get unnecessary re-application of default values, and some slightly misleading logging:
% ELASTIC_OTEL_OPAMP_ENDPOINT=localhost:4320 \
ELASTIC_OTEL_EXPERIMENTAL_OPAMP_HEARTBEAT_INTERVAL=5000 \
node --import @elastic/opentelemetry-node central-config.js
{"name":"elastic-otel-node","level":30,"preamble":true,"distroVersion":"1.5.0","env":{"os":"darwin 24.6.0","arch":"arm64","runtime":"Node.js v22.17.1"},"config":{"logLevel":"info"},"msg":"start EDOT Node.js","time":"2025-10-17T20:10:08.455Z"}
{"name":"elastic-otel-node","level":40,"msg":"Unsupported OTLP metrics protocol: \"undefined\". Using http/protobuf.","time":"2025-10-17T20:10:08.476Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"logging_level\" to \"info\"","time":"2025-10-17T20:10:08.529Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"send_traces\" to \"true\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"sampling_rate\" to \"1\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"config":{},"appliedKeys":["logging_level","send_traces","sampling_rate"],"msg":"successfully applied remote config","time":"2025-10-17T20:10:08.530Z"}
...
I understand that this means the sampling_rate: undefined
case cannot be tested with the "central-config-gen-telelemetry.js" fixture. FWIW, the logging_level: undefined
case is being tested with a custom fixture that dynamically changes the central config during its run:
elastic-otel-node/packages/opentelemetry-node/test/fixtures/central-config-logging-level.js
Lines 45 to 63 in 2090bdc
if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) { | |
await setElasticConfig({logging_level: 'debug'}); | |
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]); | |
} | |
// 3. Now the `diag.debug` should result in an emitted record. | |
diag.verbose('verbose2'); | |
diag.debug('debug2'); | |
diag.info('info2'); | |
diag.warn('warn2'); | |
diag.error('error2'); | |
// 4. Set central config to empty config, to simulate an Agent Configuration | |
// in Kibana being removed. We expect the EDOT Node.js SDK to reset back | |
// to the default (info) log level. | |
if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) { | |
await setElasticConfig({}); | |
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]); | |
} |
Personally I'm fine with not having an automated test for the undefined
case here: complexity/code-size of the tests is a trade-off. Or, I wouldn't object to a test case doing live-updating of the central config (as in the logging_level
example).
For consideration, I'll push a change that removes this and drops the test case for undefined
.
// First get as string to differentiate between missing and invalid. | ||
if (getStringFromEnv('OTEL_TRACES_SAMPLER_ARG')) { | ||
const samplingRateArg = getNumberFromEnv('OTEL_TRACES_SAMPLER_ARG'); | ||
if (samplingRateArg === undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also validate the [0,1]
range here, else an invalid value crashes:
% OTEL_TRACES_SAMPLER_ARG=-0.2 ELASTIC_OTEL_OPAMP_ENDPOINT=localhost:4320 \
ELASTIC_OTEL_EXPERIMENTAL_OPAMP_HEARTBEAT_INTERVAL=5000 \
node --import @elastic/opentelemetry-node central-config.js
/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/node_modules/@opentelemetry/sampler-composite/build/src/traceidratio.js:26
throw new Error(`Invalid sampling probability: ${ratio}. Must be between 0 and 1.`);
^
Error: Invalid sampling probability: -0.2. Must be between 0 and 1.
at new ComposableTraceIDRatioBasedSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/node_modules/@opentelemetry/sampler-composite/build/src/traceidratio.js:26:19)
at createComposableTraceIDRatioBasedSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/node_modules/@opentelemetry/sampler-composite/build/src/traceidratio.js:58:12)
at newSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sampler.js:71:13)
at new DefaultSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sampler.js:30:26)
at createDefaultSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sampler.js:84:12)
at startNodeSDK (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sdk.js:263:19)
at file:///Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/import.mjs:24:5
at ModuleJob.run (node:internal/modules/esm/module_job:329:25)
at async onImport.tracePromise.__proto__ (node:internal/modules/esm/loader:644:26)
at async asyncRunEntryPointWithESMLoader (node:internal/modules/run_main:112:9)
Node.js v22.17.1
@@ -1,4 +1,5 @@ | |||
export type NodeSDKConfiguration = import('@opentelemetry/sdk-node').NodeSDKConfiguration; | |||
export type Sampler = import('@opentelemetry/sdk-trace-base').Sampler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that added jsdoc typedefs result in exporting that type in our types.
This shouldn't hold up this PR, just noting that it surprised me a little bit. IIUC, this means that whenever we use a /* @typedef ... */
for convenience in "lib/sdk.js" we effectively change our public types. /cc @david-luna
…op 'sampling_rate: undefined' tests
- Use log.warn for applyErrs, because IMO it isn't typically an *error* case - Return an applyErr for the case of ignoring 'sampling_rate'. Now that this is just a log.warn, this feels more appropriate. Before this a user would see: {"name":"elastic-otel-node","level":30,"msg":"central-config: ignoring \"sampling_rate\" because non-default sampler in use","time":"2025-10-17T19:56:50.721Z"} {"name":"elastic-otel-node","level":30,"config":{"sampling_rate":0.5},"appliedKeys":["sampling_rate"],"msg":"successfully applied remote config","time":"2025-10-17T19:56:50.722Z"} after this, they see: {"name":"elastic-otel-node","level":40,"config":{"sampling_rate":0.5},"applyErrs":["central-config: ignoring \"sampling_rate\" because non-default sampler in use"],"msg":"could not apply all remote config settings","time":"2025-10-17T20:00:23.706Z"} which feels more accurate. - Add the "set" vs. "reset" logic for the log message. Then for the `undefined` case we see "reset", e.g.: {"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"logging_level\" to \"info\"","time":"2025-10-17T20:10:08.529Z"} {"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"sampling_rate\" to \"1\"","time":"2025-10-17T20:10:08.530Z"}
A few more things for consideration in this commit (see the commit message): |
Fixes #960
This wires up the new composite sampler to EDOT.
I followed the same approach as EDOT Java which enables the sampler unconditionally I believe
https://github.com/elastic/elastic-otel-java/blob/1254b84e16c5cf6829f2b84bf63c902b71626448/custom/src/main/java/co/elastic/otel/ElasticAutoConfigurationCustomizerProvider.java#L95
Happy to change to checking
OTEL_TRACES_SAMPLER
if that's desired, presumably the behavior should be consistent between the two though - I can update Java if we change it. Note that if going with non-configurable approach, then this would also take care of #1045 because they would just not be supported./cc @trentm @AlexanderWert @jackshirazi