Skip to content

Conversation

@mashhurs
Copy link
Collaborator

@mashhurs mashhurs commented Nov 20, 2025

Description

This change suggestions:

  • Better visibility for the entire stack by printing docker container [LS, ES, elastic-agent] logs
  • it was found that with 9.3.0-SNAPSHOT stack version, elastic-agent doesn't use SSL params. Introduced a separate config for such cases
  • increases resource/VM-spec to make sure all stack can spin up
  • fixes flaky unit test

Test CI: https://buildkite.com/elastic/logstash-filter-elastic-integration-e2e/builds/250#019aa917-7cd2-42bd-9fe9-7a44e840ae53

@mergify
Copy link

mergify bot commented Nov 20, 2025

This pull request does not have a backport label. Could you fix it @mashhurs? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

assertThat(ex.getMessage(), stringContainsInOrder("fatal", "bad_certificate"));
assertThat(ex.getMessage(), allOf(
containsString("fatal"),
anyOf(containsString("bad_certificate"), containsString("certificate_required"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some hosts (mostly with recent builds), this test fails with the below error, looks like a JVM specific.
I can't reproduce it on my local (my JVM always uses build in certificates) but taking a look at RFC-8446, the unit test is not sending any certificates, so certificate_required looks a correct validation here.
image

logstash-1       \|   Test testBasicConnectivityDisablingVerification() FAILED
--
logstash-1       \|
logstash-1       \|   java.lang.AssertionError:
logstash-1       \|   Expected: a string containing "fatal", "bad_certificate" in order
logstash-1       \|        but: was "(certificate_required) Received fatal alert: certificate_required"
logstash-1       \|       at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
logstash-1       \|       at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
logstash-1       \|       at co.elastic.logstash.filters.elasticintegration.ElasticsearchRestClientWireMockTest$MutualHttps.testBasicConnectivityDisablingVerification(ElasticsearchRestClientWireMockTest.java:178)
logstash-1       \|       at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
logstash-1       \|       at java.base/java.lang.reflect.Method.invoke(Method.java:580)

…tic_agent without setting SSL. With this change, Logstash is also opting out SSL settings for the versions greater than 9.3
input {
elastic_agent {
port => 5044
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that elastic-package defining greater than 9.3 stack version is not using SSL params in elastic-agent. So, Logstash is getting Caused by: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: error, CI link: https://buildkite.com/elastic/logstash-filter-elastic-integration-e2e/builds/249#019aa897-ce17-466d-b457-d5a9334aa87b

I have pinged @bhapas to clarify if this is an intention.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mashhurs Not sure since I was not part of this change..

"agents": {
"provider": "gcp",
"machineType": "n2-standard-4",
"machineType": "n2-standard-16",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes ES is not reachable, when I spin up VM and tried many times, it doesn't response. It looks like elastic-package fires up so much resource that ES is very slow. Changing the machine type and decreasing number of packages (only m365_defender which has more processors) to test.

for result_line in result.stdout.splitlines(): print(f"{result_line}")

# although there was an error, le's check how LS performed and make sure errors weren't because of Logstash
time.sleep(2) # make sure LS processes the event way to downstream ES
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happened mostly when ES is very slow or irresponsive but in case to make sure events are processed, leaving some window before hitting _node/stats API

container_logs = container.logs().decode('utf-8')
for log_line in container_logs.splitlines():
print(f" {log_line}")
print(f"{separator}\n")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives us better visibility what happened with entire stack.

@mashhurs mashhurs self-assigned this Nov 22, 2025
@mashhurs mashhurs added the backport-skip Skip automated backport with mergify label Nov 22, 2025
@donoghuc donoghuc self-requested a review November 25, 2025 18:29
import util

INTEGRATION_PACKAGES_TO_TEST = ["apache", "m365_defender", "nginx", "tomcat"]
INTEGRATION_PACKAGES_TO_TEST = ["m365_defender"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are using a more capable VM are we adding these back?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking to opt out them as they do not provide meaningful values to the test cases. m365_defender has more complex conditions and more processors which, I consider, is the best candidate (or TI integration) for the test. nginx, apache & tomcat are simple integrations with countable processors.
If you have strong opinion I am happy to rollback (or at least add one/couple of them).

local_config_file_path = ".buildkite/scripts/e2e-pipeline/config/"
config_file = "serverless_pipeline.conf" if self.project_type == "serverless" else "pipeline.conf"
if self.__is_version_gte(self.stack_version, 9, 3):
config_file = "pipeline-ea-without-ssl.conf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant think of a reason to special case this. I think we should figure out why this all the sudden does not use ssl rather than letting it use unencrypted.

Copy link
Collaborator Author

@mashhurs mashhurs Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's wait a bit to get a confirmation (or the plan for their fix). I would like to make sure the pipeline gets 🟢. I am thinking we may not need if there is a bug in elastic-package with elastic-agent settings.

Copy link
Member

@donoghuc donoghuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the added logging, the more flexible assertian for the bad_certificate vs certificate_required, and the more capable VM. But from what i can tell the elastic_agent stopping using ssl seems like a legit bug we need to fix rather than encoding the unencrypted communication into our tests. It feel like a legit regression we need to address. Can we split that out?

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @mashhurs

@mashhurs mashhurs requested a review from donoghuc November 26, 2025 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip automated backport with mergify

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants