Skip to content

Conversation

anuraaga
Copy link

Fixes #960

This wires up the new composite sampler to EDOT.

I followed the same approach as EDOT Java which enables the sampler unconditionally I believe

https://github.com/elastic/elastic-otel-java/blob/1254b84e16c5cf6829f2b84bf63c902b71626448/custom/src/main/java/co/elastic/otel/ElasticAutoConfigurationCustomizerProvider.java#L95

Happy to change to checking OTEL_TRACES_SAMPLER if that's desired, presumably the behavior should be consistent between the two though - I can update Java if we change it. Note that if going with non-configurable approach, then this would also take care of #1045 because they would just not be supported.

/cc @trentm @AlexanderWert @jackshirazi

@anuraaga anuraaga requested a review from a team as a code owner October 10, 2025 05:42
@anuraaga
Copy link
Author

Doh - license generation fails because of open-telemetry/opentelemetry-js#6002. Is it possible to skip some packages or we should wait for that to be released?

@trentm
Copy link
Member

trentm commented Oct 10, 2025

Doh - license generation fails

Fixed in commit 46b6c3d

@trentm
Copy link
Member

trentm commented Oct 10, 2025

Some thoughts on how EDOT Node.js (and perhaps the other langs) should handle sampling, for cases when the specified SDK OTEL_TRACES_SAMPLER and OTEL_TRACES_SAMPLER_ARG are or are not set. Also for the case when an explicit sampler option to startNodeSDK() is provided.

Proposal:

The "EDOT SDK default sampler" should be (in pseudo-code) ComposableParentThreshold(ComposableTraceIDRatio(sampleRate)), with a default sampleRate of 1.0, and configurable via the central config sampling_rate value.

  1. If a sampler option is explicitly passed to startNodeSDK({ sampler: ... }), then that should win. I.e. the central-configurable default sampler is not used.
  2. If declarative config is supported by your SDK/agent (it is not in OTel JS) and OTEL_EXPERIMENTAL_CONFIG_FILE is provided, then this wins over any OTEL_TRACES_* vars. If it is possible to implement, then if no sampler is specified in the config file the "EDOT SDK default sampler" should be used.
  3. If OTEL_TRACES_SAMPLER is provided and non-empty, then it should be used according to OTel spec (https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#general-sdk-configuration)
    • The sampling_rate central config var will not be used in this case. It would be nice to document this in the EDOT SDK's docs. (If, in the future, the OTel spec changes to use the composite samplers for OTEL_TRACES_SAMPLER values, then we can reconsider.)
    • The central config handling should log a warning or error when skipping a received sampling_rate.
  4. If OTEL_TRACES_SAMPLER_ARG is provided and is a valid sampling rate (and OTEL_TRACES_SAMPLER is not provided), then this should be used as the default sampling rate.
    • Note that if a sampling_rate is received from central config, and then a subsequent central config update removes the sampling_rate, the SDK should reset to the OTEL_TRACES_SAMPLER_ARG value and not back to 1.0.
  5. Otherwise, the "EDOT SDK default sampler" should be used.

Considered, but not proposed:

  • I considered getting fancier with handling a given OTEL_TRACES_SAMPLER. For example, using the "EDOT SDK default sampler" if the OTEL_TRACES_SAMPLER value was invalid. Or, using the "EDOT SDK default sampler" if OTEL_TRACES_SAMPLER=parentbased_traceidratio was used. However, I think getting "fancy" here more likely makes for subtle bugs. We could/should document that our centrally-configurable default sampler is not used if OTEL_TRACES_SAMPLER is specified.

@anuraaga If you have opinions, please let me know if this seems reasonable to you. I can bring it up at our APM agents weekly meeting next week to see about getting concensus.

Copy link
Member

@trentm trentm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should make sure there is a checklist item or a separate issue for docs for the sampling differences. I'm happy to help write those.

* @param {number} ratio
* @returns {Sampler} A ratio sampler which can have its ratio updated dynamically.
*/
function createDynamicCompositeParentThresholdTraceIdRatioBasedSampler(
Copy link
Member

@trentm trentm Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying not to laugh at the length of this function name. :)

@anuraaga
Copy link
Author

If you have opinions, please let me know if this seems reasonable to you. I can bring it up at our APM agents weekly meeting next week to see about getting concensus.

Thanks @trentm - that proposal is basically how I think it should work too. If you can help make sure everyone is on the same page, I can update Java to it too.

One point will be determining a OTEL_TRACES_SAMPLER value for this new sampler. It's possibly not strictly required in JS where we imperatively create the SDK but in others we would use the official distro mechanisms to implement that which would expect the value.

I created a value, currently it is not used because the sampler is forced on

https://github.com/elastic/elastic-otel-java/blob/main/custom%2Fsrc%2Fmain%2Fjava%2Fco%2Felastic%2Fotel%2Fcompositesampling%2FCompositeParentBasedTraceIdRatioBasedSamplerProvider.java#L41

Can use or change it, it's another point to get consensus on so all the languages behave the same.

Thanks!

@jackshirazi
Copy link

I agree with the proposal. Note that it's likely that the default sampler chain for the Java agent will include the rule-based-sampler too, or equivalent, as this is the declared mechanism for filtering out spans based on routes, eg no span created for /healthcheck.

Copy link
Author

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the help @trentm - I have updated based on the proposed behavior

initialConfig.sampling_rate = sdkInfo.samplingRate;
initialConfig.send_traces = !sdkInfo.contextPropagationOnly;
log.debug({initialConfig}, 'initial central config values');
lastAppliedConfig = {...initialConfig};
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't test the undefined behavior without this, and it seems to intuitively make sense to treat the initial config as the first applied config. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results some (the ones with initialConfig entries) of the central config setters being run when receiving a first config. For example, if I run an OpAMP server that will send an empty config:

% cd packages/mockopampserver
% cat config-empty.json
{}
% node lib/cli.js -F [email protected] | bunyan
...

Then run the examples/central-config.js against that, then (when central config is first received) we get unnecessary re-application of default values, and some slightly misleading logging:

% ELASTIC_OTEL_OPAMP_ENDPOINT=localhost:4320 \
        ELASTIC_OTEL_EXPERIMENTAL_OPAMP_HEARTBEAT_INTERVAL=5000 \
        node --import @elastic/opentelemetry-node central-config.js

{"name":"elastic-otel-node","level":30,"preamble":true,"distroVersion":"1.5.0","env":{"os":"darwin 24.6.0","arch":"arm64","runtime":"Node.js v22.17.1"},"config":{"logLevel":"info"},"msg":"start EDOT Node.js","time":"2025-10-17T20:10:08.455Z"}
{"name":"elastic-otel-node","level":40,"msg":"Unsupported OTLP metrics protocol: \"undefined\". Using http/protobuf.","time":"2025-10-17T20:10:08.476Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"logging_level\" to \"info\"","time":"2025-10-17T20:10:08.529Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"send_traces\" to \"true\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"sampling_rate\" to \"1\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"config":{},"appliedKeys":["logging_level","send_traces","sampling_rate"],"msg":"successfully applied remote config","time":"2025-10-17T20:10:08.530Z"}

...

I understand that this means the sampling_rate: undefined case cannot be tested with the "central-config-gen-telelemetry.js" fixture. FWIW, the logging_level: undefined case is being tested with a custom fixture that dynamically changes the central config during its run:

if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) {
await setElasticConfig({logging_level: 'debug'});
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]);
}
// 3. Now the `diag.debug` should result in an emitted record.
diag.verbose('verbose2');
diag.debug('debug2');
diag.info('info2');
diag.warn('warn2');
diag.error('error2');
// 4. Set central config to empty config, to simulate an Agent Configuration
// in Kibana being removed. We expect the EDOT Node.js SDK to reset back
// to the default (info) log level.
if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) {
await setElasticConfig({});
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]);
}

Personally I'm fine with not having an automated test for the undefined case here: complexity/code-size of the tests is a trade-off. Or, I wouldn't object to a test case doing live-updating of the central config (as in the logging_level example).

For consideration, I'll push a change that removes this and drops the test case for undefined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consideration, I'll push a change that removes this and drops the test case for undefined.

commit ee65525

@anuraaga
Copy link
Author

AFAICT the test failure is unrelated to this PR and is about metrics, but after several retries it seems to still always fail the same way

@trentm
Copy link
Member

trentm commented Oct 17, 2025

but after several retries it seems to still always fail the same way

I cannot reproduce locally on "main" running just that test file, and using the same Node.js v18 version:

% node --version
v18.20.8

% node test/central-config.test.js
...
1..117
# tests 117
# pass  117

# ok

Update: I cannot repro locally running the full test suite: npm run test-services:start; npm t.

I'm trying in #1080 with a small innocuous change to "main".

trentm added a commit that referenced this pull request Oct 17, 2025
The tests are failing in test/central-config.test.js with Node.js 18
on #1070
Using this PR to poke at that.
@trentm
Copy link
Member

trentm commented Oct 17, 2025

The test failure is in test/central-config.test.js and there are changes to central config things in the PR, so it isn't out of the realm of possibilities it is due to changes in this PR.

@trentm
Copy link
Member

trentm commented Oct 17, 2025

I can repro locally on this feature branch running just the test file and with node 18. It does no fail with Node.js v20.

% node --version
v18.20.8

% node test/central-config.test.js

@trentm
Copy link
Member

trentm commented Oct 17, 2025

See this commit message from the PR that added this (complex) assertCentralConfigGenTelemetry() utility:

d22713c

fix testing of metrics that failed on a particular version of node due to timing

This test would reliably fail with Node.js v18.20.8, but pass with v18.19.0.
The different node versions would have diff numbers of collected batches of
exported metrics for some reason, which tickled the imprecision in which
collected metrics to test. Using 'afterUnixNano' for metrics selection
makes that precise.

Yuck. I don't recall ever chasing down **why behaviour of metric batches was different between Node v18.20.8 and v18.19.0 versions. However, indeed the tests pass with v18.19.0:

% node --version
v18.19.0

% DEV_TEST_FILTER=non-default node test/central-config.test.js
...
# ok

Yuck. The answer then was to use the added metricsAfterNs option. I'll try the same.

@trentm
Copy link
Member

trentm commented Oct 17, 2025

The answer then was to use the added metricsAfterNs option. I'll try the same.

Yes that works. For example, this change fixes the name: 'central-config-gen-telemetry.js sampling_rate=0.0 non-default sampler', test case:

@@ -854,14 +862,23 @@ test('central-config', (suite) => {
             },
             // verbose: true,
             checkTelemetry: (t, col, stdout) => {
-                assertCentralConfigGenTelemetry(t, col, [
+                const ccAppliedRe = /^CENTRAL_CONFIG_APPLIED: (\d+)$/m;
+                const ccAppliedStr = ccAppliedRe.exec(stdout)[1];
+                const ccAppliedNs = BigInt(ccAppliedStr) * 1000000n;
+                assertCentralConfigGenTelemetry(
+                    t,
+                    col,
+                    [
                         'spans',
                         'metrics',
                         'logs',
                         'instr-runtime-node',
                         'instr-undici',
                         'instr-http',
-                ]);
+                    ],
+                    ccAppliedNs
+                );
+
                 const recs = stdout
                     .split(/\r?\n/g)
                     .filter((ln) => ln.startsWith('{'))

However, I think these Sampler-related test really only care about spans being included or not, so let's just count spans and ignore the more general assertCentralConfigGenTelemetry() utility. For example, this diff:

@@ -854,14 +862,8 @@ test('central-config', (suite) => {
             },
             // verbose: true,
             checkTelemetry: (t, col, stdout) => {
-                assertCentralConfigGenTelemetry(t, col, [
-                    'spans',
-                    'metrics',
-                    'logs',
-                    'instr-runtime-node',
-                    'instr-undici',
-                    'instr-http',
-                ]);
+                t.equal(col.sortedSpans.length, 3, 'got all the spans');
+
                 const recs = stdout
                     .split(/\r?\n/g)
                     .filter((ln) => ln.startsWith('{'))

passes that test case:

% DEV_TEST_FILTER=non-default node test/central-config.test.js
TAP version 13
# central-config
ok 1 DEV_TEST_FILTER should not result in an *empty* `testFixtures`
# Filtering "testFixtures" with DEV_TEST_FILTER="non-default"
# setup: MockOpAMPServer
# central-config-gen-telemetry.js sampling_rate=0.0 non-default sampler
ok 2 exited successfully: err=null
ok 3 got all the spans
ok 4 should be truthy
# teardown: MockOpAMPServer

1..4
# tests 4
# pass  4

# ok

… whether the sampling rate is as expected)

This bypasses the gross diff in metric collection batch behaviour with just Node.js v18.20.8
(the latest current version). See comments on the PR.
initialConfig.sampling_rate = sdkInfo.samplingRate;
initialConfig.send_traces = !sdkInfo.contextPropagationOnly;
log.debug({initialConfig}, 'initial central config values');
lastAppliedConfig = {...initialConfig};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results some (the ones with initialConfig entries) of the central config setters being run when receiving a first config. For example, if I run an OpAMP server that will send an empty config:

% cd packages/mockopampserver
% cat config-empty.json
{}
% node lib/cli.js -F [email protected] | bunyan
...

Then run the examples/central-config.js against that, then (when central config is first received) we get unnecessary re-application of default values, and some slightly misleading logging:

% ELASTIC_OTEL_OPAMP_ENDPOINT=localhost:4320 \
        ELASTIC_OTEL_EXPERIMENTAL_OPAMP_HEARTBEAT_INTERVAL=5000 \
        node --import @elastic/opentelemetry-node central-config.js

{"name":"elastic-otel-node","level":30,"preamble":true,"distroVersion":"1.5.0","env":{"os":"darwin 24.6.0","arch":"arm64","runtime":"Node.js v22.17.1"},"config":{"logLevel":"info"},"msg":"start EDOT Node.js","time":"2025-10-17T20:10:08.455Z"}
{"name":"elastic-otel-node","level":40,"msg":"Unsupported OTLP metrics protocol: \"undefined\". Using http/protobuf.","time":"2025-10-17T20:10:08.476Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"logging_level\" to \"info\"","time":"2025-10-17T20:10:08.529Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"send_traces\" to \"true\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"sampling_rate\" to \"1\"","time":"2025-10-17T20:10:08.530Z"}
{"name":"elastic-otel-node","level":30,"config":{},"appliedKeys":["logging_level","send_traces","sampling_rate"],"msg":"successfully applied remote config","time":"2025-10-17T20:10:08.530Z"}

...

I understand that this means the sampling_rate: undefined case cannot be tested with the "central-config-gen-telelemetry.js" fixture. FWIW, the logging_level: undefined case is being tested with a custom fixture that dynamically changes the central config during its run:

if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) {
await setElasticConfig({logging_level: 'debug'});
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]);
}
// 3. Now the `diag.debug` should result in an emitted record.
diag.verbose('verbose2');
diag.debug('debug2');
diag.info('info2');
diag.warn('warn2');
diag.error('error2');
// 4. Set central config to empty config, to simulate an Agent Configuration
// in Kibana being removed. We expect the EDOT Node.js SDK to reset back
// to the default (info) log level.
if (process.env.ELASTIC_OTEL_OPAMP_ENDPOINT) {
await setElasticConfig({});
await barrierOpAMPClientDiagEvents(3, [DIAG_CH_SEND_SUCCESS]);
}

Personally I'm fine with not having an automated test for the undefined case here: complexity/code-size of the tests is a trade-off. Or, I wouldn't object to a test case doing live-updating of the central config (as in the logging_level example).

For consideration, I'll push a change that removes this and drops the test case for undefined.

// First get as string to differentiate between missing and invalid.
if (getStringFromEnv('OTEL_TRACES_SAMPLER_ARG')) {
const samplingRateArg = getNumberFromEnv('OTEL_TRACES_SAMPLER_ARG');
if (samplingRateArg === undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also validate the [0,1] range here, else an invalid value crashes:

% OTEL_TRACES_SAMPLER_ARG=-0.2 ELASTIC_OTEL_OPAMP_ENDPOINT=localhost:4320 \
        ELASTIC_OTEL_EXPERIMENTAL_OPAMP_HEARTBEAT_INTERVAL=5000 \
        node --import @elastic/opentelemetry-node central-config.js

/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/node_modules/@opentelemetry/sampler-composite/build/src/traceidratio.js:26
            throw new Error(`Invalid sampling probability: ${ratio}. Must be between 0 and 1.`);
                  ^

Error: Invalid sampling probability: -0.2. Must be between 0 and 1.
    at new ComposableTraceIDRatioBasedSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/node_modules/@opentelemetry/sampler-composite/build/src/traceidratio.js:26:19)
    at createComposableTraceIDRatioBasedSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/node_modules/@opentelemetry/sampler-composite/build/src/traceidratio.js:58:12)
    at newSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sampler.js:71:13)
    at new DefaultSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sampler.js:30:26)
    at createDefaultSampler (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sampler.js:84:12)
    at startNodeSDK (/Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/lib/sdk.js:263:19)
    at file:///Users/trentm/el/elastic-otel-node2/packages/opentelemetry-node/import.mjs:24:5
    at ModuleJob.run (node:internal/modules/esm/module_job:329:25)
    at async onImport.tracePromise.__proto__ (node:internal/modules/esm/loader:644:26)
    at async asyncRunEntryPointWithESMLoader (node:internal/modules/run_main:112:9)

Node.js v22.17.1

@@ -1,4 +1,5 @@
export type NodeSDKConfiguration = import('@opentelemetry/sdk-node').NodeSDKConfiguration;
export type Sampler = import('@opentelemetry/sdk-trace-base').Sampler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that added jsdoc typedefs result in exporting that type in our types.
This shouldn't hold up this PR, just noting that it surprised me a little bit. IIUC, this means that whenever we use a /* @typedef ... */ for convenience in "lib/sdk.js" we effectively change our public types. /cc @david-luna

- Use log.warn for applyErrs, because IMO it isn't typically an *error* case
- Return an applyErr for the case of ignoring 'sampling_rate'. Now that
  this is just a log.warn, this feels more appropriate. Before this a
  user would see:

        {"name":"elastic-otel-node","level":30,"msg":"central-config: ignoring \"sampling_rate\" because non-default sampler in use","time":"2025-10-17T19:56:50.721Z"}
        {"name":"elastic-otel-node","level":30,"config":{"sampling_rate":0.5},"appliedKeys":["sampling_rate"],"msg":"successfully applied remote config","time":"2025-10-17T19:56:50.722Z"}

  after this, they see:

        {"name":"elastic-otel-node","level":40,"config":{"sampling_rate":0.5},"applyErrs":["central-config: ignoring \"sampling_rate\" because non-default sampler in use"],"msg":"could not apply all remote config settings","time":"2025-10-17T20:00:23.706Z"}

  which feels more accurate.

- Add the "set" vs. "reset" logic for the log message. Then for the
  `undefined` case we see "reset", e.g.:

        {"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"logging_level\" to \"info\"","time":"2025-10-17T20:10:08.529Z"}
        {"name":"elastic-otel-node","level":30,"msg":"central-config: reset \"sampling_rate\" to \"1\"","time":"2025-10-17T20:10:08.530Z"}
@trentm
Copy link
Member

trentm commented Oct 17, 2025

A few more things for consideration in this commit (see the commit message):
f1211c1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support opamp_polling_interval and sampling_rate in central config

3 participants