Integration test to check data re-ingest when switching runtimes #10544

leehinman · 2025-10-14T15:25:42Z

What does this PR do?

Adds integration test that checks if switching to otel runtime for
monitoring causes the elastic-agent log files to be re-ingested. Also
check to make sure that restarting the agent with otel runtime does
not result in re-ingesting the log files.

Why is it important?

re-ingestion is wasteful and if restarts caused state to be lost we
could loose data.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

None. Integration test only

How to test this PR locally

mage -v integration:single TestMonitoringNoDuplicates

Related issues

Closes Ensure switching to the otel runtime does not result in input data duplication. #8207

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-10-14T15:26:34Z

This pull request does not have a backport label. Could you fix it @leehinman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

elasticmachine · 2025-10-14T21:08:11Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

ycombinator · 2025-10-14T22:33:54Z

testing/integration/ess/beat_receivers_test.go

+	// It is possible to have a batch in flight not get acked
+	// before elastic-agent shutsdown. That can lead to


Is there no way to wait (for a reasonable amount of time, maybe a few seconds?) on that last batch to be indexed before restarting Agent in the test? That would make the test more deterministic IMO.

The test is checking to see if the "last" log message from startup has been indexed. That is part of what what the healthcheck function is doing. This does make the test much more consistent. This is also why the test disables monitoring and then restarts before enabling it with otel mode. But I haven't found a way to make this totally deterministic.

There is always a window where elastic-agent can produce a log, elastic-agent or agentbeat "sees" the log and tries to send it, then we kill elastic-agent. In that scenario the registry on disk will always be behind what elasticsearch has indexed and we will produce a duplicate on the next restart. But this should be a small number of duplicates and should not be the "entire file" which is what we are trying to make sure isn't happening.

We don't kill elastic-agent though, we restart it in an orderly way. In principle, there shouldn't be any duplicates in this case, right?

In principle, there shouldn't be any duplicates in this case, right?

I started out with this assumption too. But, we don't have a "only once" delivery guarantee, we have a "at least once" delivery guarantee. Because of this having occasional duplicates is "normal behavior", but re-ingesting the whole file would not be normal behavior. Which is why I switched to accepting a "small percentage" of duplicates. Even if we shut down "in an orderly way", there are still timeouts that could cause us to exit without receiving a reply from elasticsearch even though the elasticsearch bulk request eventually succeeds. And these timeouts are necessary because we don't want to prevent shutting down due to network issues etc.

elasticmachine · 2025-10-15T07:11:28Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm

LGTM overall, comments are non-blocking. I am somewhat concerned that we're tolerating too many duplicates, though.

swiatekm · 2025-10-15T11:15:58Z

testing/integration/ess/beat_receivers_test.go

+	policy.Agent.Monitoring["enabled"] = false
+	updatedPolicyBytes, err = yaml.Marshal(policy)
+	require.NoErrorf(t, err, "error marshalling policy, struct was %v", policy)
+	err = fut.Configure(ctx, updatedPolicyBytes)
+	require.NoError(t, err)
+
+	// restart to make sure beat processes have exited
+	restartBytes, err = fut.Exec(ctx, []string{"restart"})
+	require.NoErrorf(t, err, "Restart error: %s, output was: %s", err, string(restartBytes))


This is probably fine, but the more idiomatic way would be to update the policy in Kibana and wait until agent reports it as applied. Any specific reason to reload the configuration locally instead?

That's a good idea. Do we have an example of using the API to change the runtime?

@swiatekm updated to using API to make changes, plus enrolled in Fleet.

swiatekm · 2025-10-15T11:22:05Z

testing/integration/ess/beat_receivers_test.go

+	// It is possible to have a batch in flight not get acked
+	// before elastic-agent shutsdown. That can lead to


We don't kill elastic-agent though, we restart it in an orderly way. In principle, there shouldn't be any duplicates in this case, right?

leehinman · 2025-10-17T16:24:04Z

Putting back in draft until #10645 is fixed.

elasticmachine · 2025-10-17T16:38:38Z

💔 Build Failed

Buildkite Build
Commit: 3ad4cb9

Failed CI Steps

History

💔 Build #28816 failed 6f17b11
💛 Build #28656 was flaky 177ca5b
💔 Build #28635 failed 39cd769
💛 Build #28629 was flaky d06bcc5

cc @leehinman

Integration test to check data re-ingest when switching runtimes

d06bcc5

mergify bot assigned leehinman Oct 14, 2025

leehinman added skip-changelog backport-9.2 Automated backport to the 9.2 branch labels Oct 14, 2025

Merge branch 'main' into beatreceiver_no_dups

39cd769

leehinman marked this pull request as ready for review October 14, 2025 17:50

leehinman requested a review from a team as a code owner October 14, 2025 17:50

leehinman requested review from pkoutsovasilis and ycombinator October 14, 2025 17:50

allow 10% duplicates

177ca5b

leehinman added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 14, 2025

ycombinator reviewed Oct 14, 2025

View reviewed changes

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 15, 2025

pierrehilbert requested a review from swiatekm October 15, 2025 07:11

swiatekm previously approved these changes Oct 15, 2025

View reviewed changes

switch to enroll & Kibana policy updates

6f17b11

leehinman dismissed swiatekm’s stale review via 6f17b11 October 16, 2025 02:10

remove disabling monitoring and add switch back to process

3ad4cb9

leehinman marked this pull request as draft October 17, 2025 16:22

		// It is possible to have a batch in flight not get acked
		// before elastic-agent shutsdown. That can lead to

Integration test to check data re-ingest when switching runtimes #10544

Are you sure you want to change the base?

Integration test to check data re-ingest when switching runtimes #10544

Uh oh!

Conversation

leehinman commented Oct 14, 2025

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

elasticmachine commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Oct 15, 2025

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leehinman commented Oct 17, 2025

Uh oh!

elasticmachine commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elasticmachine commented Oct 17, 2025 •

edited

Loading