Skip to content

[FIPS] Test that ES client will not connect to ES with invalid TLS certificate#5088

Merged
ycombinator merged 13 commits intoelastic:mainfrom
ycombinator:fips-es-conn
Jul 10, 2025
Merged

[FIPS] Test that ES client will not connect to ES with invalid TLS certificate#5088
ycombinator merged 13 commits intoelastic:mainfrom
ycombinator:fips-es-conn

Conversation

@ycombinator
Copy link
Contributor

What is the problem this PR solves?

This PR ensures that any connections made by a FIPS-capable Fleet Server to Elasticsearch will only succeed if Elasticsearch is also FIPS-capable.

How does this PR solve the problem?

This PR adds a new test, TestConnectionTLS, that fakes an Elasticsearch HTTPS server that returns a TLS certificate that's been created with a key length of < 2048 bits, making it invalid for FIPS-compliant use.

If running in FIPS mode, the test asserts that Fleet Server's connection to Elasticsearch will fail with a TLS error.
If not running in FIPS mode, the test asserts that Fleet Server's connection to Elasticsearch will succeed.

How to test this PR locally

In a non-FIPS environment:

$ go test ./internal/pkg/es/... -v -test.run TestConnectionTLS -test.count 1
=== RUN   TestConnectionTLS
--- PASS: TestConnectionTLS (0.00s)
PASS
ok  	github.com/elastic/fleet-server/v7/internal/pkg/es	0.389s

In a FIPS environment, i.e. with the Microsoft Go fork installed and with the OpenSSL FIPS provider installed:

$ GOEXPERIMENT=systemcrypto go test --tags=requirefips ./internal/pkg/es/... -v -test.run TestConnectionTLS -test.count 1
=== RUN   TestConnectionTLS
2025/07/03 15:46:02 http: TLS handshake error from 127.0.0.1:52214: tls: failed to sign handshake: EVP_PKEY_sign_init failed
openssl error(s):
error:1C800069:Provider routines::invalid key length
	providers/common/securitycheck.c:65
--- PASS: TestConnectionTLS (0.00s)
PASS
ok  	github.com/elastic/fleet-server/v7/internal/pkg/es	0.022s

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@ycombinator ycombinator requested a review from a team as a code owner July 3, 2025 22:52
@prodsecmachine
Copy link

prodsecmachine commented Jul 3, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@ycombinator ycombinator added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch labels Jul 3, 2025
@michel-laterman
Copy link
Contributor

We don't have unit tests that run with the FIPS provider here, we should probably change that if it's now required

@ycombinator
Copy link
Contributor Author

@v1v @oakrizan @pazone @pkoutsovasilis could one of you please help me understand why CI is failing on this PR? See https://buildkite.com/elastic/fleet-server/builds/9107#0197e9e5-3d2b-407b-90cc-b958d69a7747. I don't understand which command is exiting with status 126.

In general, what can be done to make it easier to debug such issues? I tried added set -o xtrace at the start of the shell script being invoked by this CI step but it didn't seem to make it any more obvious which command is failing and why.

@v1v
Copy link
Member

v1v commented Jul 8, 2025

I don't understand which command is exiting with status 126.

Does it work in main? Or is just a new step?

i can see https://buildkite.com/elastic/fleet-server/builds/9100#0197e5b8-3de9-447a-b0de-0c0f679427f2/1 used .buildkite/scripts/unit_test_fipsonly.sh

regardless, I see it uses with_msft_go, however, I can see the pre-command hook installs also the golang version, hence I don't understand what's the reason for running with_msft_go or with_go again

I can see some errors when invoking mage -clean

2025-07-08 14:05:46 CEST | + mage -clean
-- | --
  | 2025-07-08 14:05:46 CEST | No version is set for command mage
  | 2025-07-08 14:05:46 CEST | Consider adding one of the following versions in your config file at
  | 2025-07-08 14:05:46 CEST | mage 1.14.0
  | 2025-07-08 14:05:46 CEST | mage 1.15.0

IIUC, this project does not use asdf to install mage, correct? hence there might be some collision?

what can be done to make it easier to debug such issues?

The platform productivity team enabled a feature to launch VM images, so you can create a VM based on platform-ingest-fleet-server-ubuntu-2204-fips-1751684469


I've just found you added platform-ingest-elastic-agent-ubuntu-2204-fips-1750467641 and that's not intended to be used here.

image

That's the reason some references for asdf, yoo should use the platform-ingest-fleet-server-ubuntu-2204-fips image.

I see you changed recently and https://buildkite.com/elastic/fleet-server/builds/9114#0197ead1-94ad-477c-8830-9063843a8b55/138-191 is failing for other reasons

@ycombinator
Copy link
Contributor Author

Thanks @v1v for the prompt response, as always! ❤️

I see you changed recently and https://buildkite.com/elastic/fleet-server/builds/9114#0197ead1-94ad-477c-8830-9063843a8b55/138-191 is failing for other reasons

Yes, this was something @michel-laterman suggested to me off-PR (thanks!) and it seems to have got us past the 126 exit code failure so I'm in good shape now. 👍

@ycombinator
Copy link
Contributor Author

The platform productivity team enabled a feature to launch VM images, so you can create a VM based on platform-ingest-fleet-server-ubuntu-2204-fips-1751684469

Thanks @v1v. I created an EC2 instance with this image and am able to reproduce the failure seen in CI over there. 👍

@ycombinator ycombinator force-pushed the fips-es-conn branch 2 times, most recently from ada8418 to 582d93f Compare July 8, 2025 23:46
@ycombinator
Copy link
Contributor Author

ycombinator commented Jul 9, 2025

Both FIPS unit test steps are failing in CI on this PR like so:

Starting the unit tests...
--
  | + mage test:unit test:junitReport
  | go: downloading github.com/elastic/elastic-agent-libs v0.21.0
  | ... omitted for brevity ...
  | go: downloading github.com/go-logr/stdr v1.2.2
  | panic: LoadImport called with empty package path [recovered]
  | panic: LoadImport called with empty package path
  |  
  | goroutine 1 [running]:
  | cmd/go/internal/load.(*preload).flush(0xc000782000)
  | cmd/go/internal/load/pkg.go:1128 +0x6a
  | panic({0xa80aa0?, 0xc806e0?})
  | runtime/panic.go:792 +0x132
  | cmd/go/internal/load.loadImport({0xc89160, 0x10c0220}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, 0x0, {0xc000120855, ...}, ...)
  | cmd/go/internal/load/pkg.go:717 +0x1530
  | cmd/go/internal/load.(*Package).load(0xc002138c08, {0xc89160, 0x10c0220}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, {0xc00033ac30, ...}, ...)
  | cmd/go/internal/load/pkg.go:2035 +0x24ce
  | cmd/go/internal/load.loadImport({0xc89160, 0x10c0220}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, 0xc000782000, {0xc00033ac30, ...}, ...)
  | cmd/go/internal/load/pkg.go:780 +0x52f
  | cmd/go/internal/load.PackagesAndErrors({0xc89160?, 0x10c0220?}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, {0xc000039040, 0x1, ...})
  | cmd/go/internal/load/pkg.go:2925 +0xa33
  | cmd/go/internal/test.runTest({0xc89160, 0x10c0220}, 0x1090880, {0xc000020090?, 0xa80aa0?, 0xa80c60?})
  | cmd/go/internal/test/test.go:706 +0x369
  | main.invoke(0x1090880, {0xc000020080, 0x6, 0x6})
  | cmd/go/main.go:341 +0x845
  | main.main()
  | cmd/go/main.go:220 +0xe8b
  | Error: running "go test -tags=grpcnotrace,requirefips,ms_tls13kdf -v -race -coverprofile=build/coverage-linux.out ./..." failed with exit code 2

I manually spun up an EC2 VM with the same AMI as used in this PR, platform-ingest-fleet-server-ubuntu-2204-fips-1751684469, and am able to reproduce the failure.

$ buildkite-agent@ip-172-31-29-122:~/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server$
buildkite-agent@ip-172-31-29-122:~/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server$ pwd
/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server
$ whoami
buildkite-agent
$ FIPS=true ./.buildkite/scripts/unit_test.sh
... omitted for brevity ...
+ with_msft_go
+ echo 'Setting up microsoft/go'
Setting up microsoft/go
+ create_workspace
+ [[ ! -d /opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin ]]
+ check_platform_architeture
+ case "${hw_type}" in
+ arch_type=amd64
++ cat .go-version
+ MSFT_DOWNLOAD_URL=https://aka.ms/golang/release/latest/go1.24.4.linux-amd64.tar.gz
++ curl -sL -o - https://aka.ms/golang/release/latest/go1.24.4.linux-amd64.tar.gz
++ tar -xz -f - -C /opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin
+ retry 5
+ local retries=5
+ shift
+ local count=0
+ return 0
+ export PATH=/opt/buildkite-agent/.asdf/shims:/opt/buildkite-agent/.asdf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin/go/bin
+ PATH=/opt/buildkite-agent/.asdf/shims:/opt/buildkite-agent/.asdf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin/go/bin
+ go version
go version go1.24.4 linux/amd64
+ which go
/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin/go/bin/go
++ go env GOPATH
+ export PATH=/opt/buildkite-agent/.asdf/shims:/opt/buildkite-agent/.asdf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin/go/bin:/opt/buildkite-agent/go/bin
+ PATH=/opt/buildkite-agent/.asdf/shims:/opt/buildkite-agent/.asdf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/puppetlabs/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin:/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server/bin/go/bin:/opt/buildkite-agent/go/bin
... omitted for brevity ...
Starting the unit tests...
+ mage test:unit test:junitReport
panic: LoadImport called with empty package path [recovered]
	panic: LoadImport called with empty package path

goroutine 1 [running]:
cmd/go/internal/load.(*preload).flush(0xc000bbabb0)
	cmd/go/internal/load/pkg.go:1128 +0x6a
panic({0xa80aa0?, 0xc806e0?})
	runtime/panic.go:792 +0x132
cmd/go/internal/load.loadImport({0xc89160, 0x10c0220}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, 0x0, {0x7fd1e732e855, ...}, ...)
	cmd/go/internal/load/pkg.go:717 +0x1530
cmd/go/internal/load.(*Package).load(0xc001a03808, {0xc89160, 0x10c0220}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, {0xc0005a32c0, ...}, ...)
	cmd/go/internal/load/pkg.go:2035 +0x24ce
cmd/go/internal/load.loadImport({0xc89160, 0x10c0220}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, 0xc000bbabb0, {0xc0005a32c0, ...}, ...)
	cmd/go/internal/load/pkg.go:780 +0x52f
cmd/go/internal/load.PackagesAndErrors({0xc89160?, 0x10c0220?}, {0x0, 0x1, 0x0, 0x0, 0x0, 0x0}, {0xc000030fe0, 0x1, ...})
	cmd/go/internal/load/pkg.go:2925 +0xa33
cmd/go/internal/test.runTest({0xc89160, 0x10c0220}, 0x1090880, {0xc000020090?, 0xa80aa0?, 0xa80c60?})
	cmd/go/internal/test/test.go:706 +0x369
main.invoke(0x1090880, {0xc000020080, 0x6, 0x6})
	cmd/go/main.go:341 +0x845
main.main()
	cmd/go/main.go:220 +0xe8b
Error: running "go test -tags=grpcnotrace,requirefips,ms_tls13kdf -v -race -coverprofile=build/coverage-linux.out ./..." failed with exit code 2

But I haven't figured out why it's happening or how to fix it. Some observations that might give some hints:

If I run the go test command above but provide a package list that starts with a folder, the command succeeds:

$ GOEXPERIMENT=systemcrypto ./bin/go/bin/go test -tags=grpcnotrace,requirefips,ms_tls13kdf -v -race -coverprofile=build/coverage-linux.out ./internal/...
=== RUN   TestNewDispatcher
--- PASS: TestNewDispatcher (0.00s)
... omitted for brevity ...
PASS
coverage: 80.0% of statements
ok  	github.com/elastic/fleet-server/v7/internal/pkg/ver	1.110s	coverage: 80.0% of statements

If I don't use the microsoft/go fork (by not specifying FIPS=true when running the unit_test.sh script), the command succeeds:

$ pwd
/opt/buildkite-agent/builds/bk-agent-prod-aws-1752018683648035648/elastic/fleet-server
$ rm -rf bin/go    # Cleanup microsoft/go installation
$ ./.buildkite/scripts/unit_test.sh
... omitted for brevity ...
Starting the unit tests...
+ mage test:unit test:junitReport
... omitted for brevity ...
PASS
coverage: 80.0% of statements
ok  	github.com/elastic/fleet-server/v7/internal/pkg/ver	1.043s	coverage: 80.0% of statements
?   	github.com/elastic/fleet-server/v7/version	[no test files]
$ echo $?
0

@v1v @michel-laterman have you seen this error before? any ideas as to what might be going on or how to try and fix this?

@ycombinator ycombinator force-pushed the fips-es-conn branch 2 times, most recently from 83ac0b6 to bc2796f Compare July 9, 2025 17:30
@ycombinator ycombinator requested a review from v1v July 9, 2025 21:04
michel-laterman
michel-laterman previously approved these changes Jul 9, 2025
Copy link
Member

@v1v v1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (I only reviewed the CI changes)

@elastic-sonarqube
Copy link

@ycombinator ycombinator enabled auto-merge (squash) July 10, 2025 17:38
@ycombinator ycombinator merged commit c0ae099 into elastic:main Jul 10, 2025
9 checks passed
mergify bot pushed a commit that referenced this pull request Jul 10, 2025
…rtificate (#5088)

* Adding unit test for connecting to FIPS-incapable ES

* Make linter happy

* Reordering imports

* Run FIPS unit tests on FIPS VM

* Install Microsoft Go if FIPS=true

* Debugging

* Use fleet server FIPS VM image

* Debugging: extracting microsoft/go outside of fleet-server folder

* Explicitly specify Go distribution for tests

* Use temporary folder for microsoft/go SDK

* Don't pass GOEXPERIMENT=systemcrypto when running tests with Go stdlib

* Remove debugging statements

* Reduce VM size

(cherry picked from commit c0ae099)
mergify bot pushed a commit that referenced this pull request Jul 10, 2025
…rtificate (#5088)

* Adding unit test for connecting to FIPS-incapable ES

* Make linter happy

* Reordering imports

* Run FIPS unit tests on FIPS VM

* Install Microsoft Go if FIPS=true

* Debugging

* Use fleet server FIPS VM image

* Debugging: extracting microsoft/go outside of fleet-server folder

* Explicitly specify Go distribution for tests

* Use temporary folder for microsoft/go SDK

* Don't pass GOEXPERIMENT=systemcrypto when running tests with Go stdlib

* Remove debugging statements

* Reduce VM size

(cherry picked from commit c0ae099)
ycombinator added a commit that referenced this pull request Jul 10, 2025
…rtificate (#5088) (#5142)

* Adding unit test for connecting to FIPS-incapable ES

* Make linter happy

* Reordering imports

* Run FIPS unit tests on FIPS VM

* Install Microsoft Go if FIPS=true

* Debugging

* Use fleet server FIPS VM image

* Debugging: extracting microsoft/go outside of fleet-server folder

* Explicitly specify Go distribution for tests

* Use temporary folder for microsoft/go SDK

* Don't pass GOEXPERIMENT=systemcrypto when running tests with Go stdlib

* Remove debugging statements

* Reduce VM size

(cherry picked from commit c0ae099)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Jul 10, 2025
…rtificate (#5088) (#5141)

* Adding unit test for connecting to FIPS-incapable ES

* Make linter happy

* Reordering imports

* Run FIPS unit tests on FIPS VM

* Install Microsoft Go if FIPS=true

* Debugging

* Use fleet server FIPS VM image

* Debugging: extracting microsoft/go outside of fleet-server folder

* Explicitly specify Go distribution for tests

* Use temporary folder for microsoft/go SDK

* Don't pass GOEXPERIMENT=systemcrypto when running tests with Go stdlib

* Remove debugging statements

* Reduce VM size

(cherry picked from commit c0ae099)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants