Skip to content

Conversation

leehinman
Copy link
Contributor

What does this PR do?

Adds integration test that checks if switching to otel runtime for
monitoring causes the elastic-agent log files to be re-ingested. Also
check to make sure that restarting the agent with otel runtime does
not result in re-ingesting the log files.

Why is it important?

re-ingestion is wasteful and if restarts caused state to be lost we
could loose data.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None. Integration test only

How to test this PR locally

mage -v integration:single TestMonitoringNoDuplicates

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

Copy link
Contributor

mergify bot commented Oct 14, 2025

This pull request does not have a backport label. Could you fix it @leehinman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@leehinman leehinman added skip-changelog backport-9.2 Automated backport to the 9.2 branch labels Oct 14, 2025
@leehinman leehinman marked this pull request as ready for review October 14, 2025 17:50
@leehinman leehinman requested a review from a team as a code owner October 14, 2025 17:50
@leehinman leehinman added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 14, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Comment on lines +1513 to +1514
// It is possible to have a batch in flight not get acked
// before elastic-agent shutsdown. That can lead to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to wait (for a reasonable amount of time, maybe a few seconds?) on that last batch to be indexed before restarting Agent in the test? That would make the test more deterministic IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is checking to see if the "last" log message from startup has been indexed. That is part of what what the healthcheck function is doing. This does make the test much more consistent. This is also why the test disables monitoring and then restarts before enabling it with otel mode. But I haven't found a way to make this totally deterministic.

There is always a window where elastic-agent can produce a log, elastic-agent or agentbeat "sees" the log and tries to send it, then we kill elastic-agent. In that scenario the registry on disk will always be behind what elasticsearch has indexed and we will produce a duplicate on the next restart. But this should be a small number of duplicates and should not be the "entire file" which is what we are trying to make sure isn't happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't kill elastic-agent though, we restart it in an orderly way. In principle, there shouldn't be any duplicates in this case, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, there shouldn't be any duplicates in this case, right?

I started out with this assumption too. But, we don't have a "only once" delivery guarantee, we have a "at least once" delivery guarantee. Because of this having occasional duplicates is "normal behavior", but re-ingesting the whole file would not be normal behavior. Which is why I switched to accepting a "small percentage" of duplicates. Even if we shut down "in an orderly way", there are still timeouts that could cause us to exit without receiving a reply from elasticsearch even though the elasticsearch bulk request eventually succeeds. And these timeouts are necessary because we don't want to prevent shutting down due to network issues etc.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 15, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm
swiatekm previously approved these changes Oct 15, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, comments are non-blocking. I am somewhat concerned that we're tolerating too many duplicates, though.

Comment on lines 1415 to 1423
policy.Agent.Monitoring["enabled"] = false
updatedPolicyBytes, err = yaml.Marshal(policy)
require.NoErrorf(t, err, "error marshalling policy, struct was %v", policy)
err = fut.Configure(ctx, updatedPolicyBytes)
require.NoError(t, err)

// restart to make sure beat processes have exited
restartBytes, err = fut.Exec(ctx, []string{"restart"})
require.NoErrorf(t, err, "Restart error: %s, output was: %s", err, string(restartBytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but the more idiomatic way would be to update the policy in Kibana and wait until agent reports it as applied. Any specific reason to reload the configuration locally instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Do we have an example of using the API to change the runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swiatekm updated to using API to make changes, plus enrolled in Fleet.

Comment on lines +1513 to +1514
// It is possible to have a batch in flight not get acked
// before elastic-agent shutsdown. That can lead to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't kill elastic-agent though, we restart it in an orderly way. In principle, there shouldn't be any duplicates in this case, right?

@leehinman leehinman marked this pull request as draft October 17, 2025 16:22
@leehinman
Copy link
Contributor Author

Putting back in draft until #10645 is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-9.2 Automated backport to the 9.2 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure switching to the otel runtime does not result in input data duplication.

5 participants