Skip to content

chore: network keepalive#319

Open
titouantanguy wants to merge 3 commits intomainfrom
titouan/chore/2835/network-keepalive
Open

chore: network keepalive#319
titouantanguy wants to merge 3 commits intomainfrom
titouan/chore/2835/network-keepalive

Conversation

@titouantanguy
Copy link
Contributor

@titouantanguy titouantanguy commented Dec 9, 2025

Description of changes

Adds keepalive to the gRPC connections (both client and server) as well as on the socat proxies.
Note that we may want to also add keepalive to the gRPC client in the connector ?

Issue ticket number and link

Shot at https://github.com/zama-ai/kms-internal/issues/2835

PR Checklist

I attest that all checked items are satisfied. Any deviation is clearly justified above.

  • Title follows conventional commits (e.g. chore: ...).
  • Tests added for every new pub item and test coverage has not decreased.
  • Public APIs and non-obvious logic documented; unfinished work marked as TODO(#issue).
  • unwrap/expect/panic only in tests or for invariant bugs (documented if present).
  • No dependency version changes OR (if changed) only minimal required fixes.
  • No architectural protocol changes OR linked spec PR/issue provided.
  • No breaking deployment config changes OR devops label + infra notified + infra-team reviewer assigned.
  • No breaking gRPC / serialized data changes OR commit marked with ! and affected teams notified.
  • No modifications to existing versionized structs OR backward compatibility tests updated.
  • No critical business logic / crypto changes OR ≥2 reviewers assigned.
  • No new sensitive data fields added OR Zeroize + ZeroizeOnDrop implemented.
  • No new public storage data OR data is verifiable (signature / digest).
  • No unsafe; if unavoidable: minimal, justified, documented, and test/fuzz covered.
  • Strongly typed boundaries: typed inputs validated at the edge; no untyped values or errors cross modules.
  • Self-review completed.

Dependency Update Questionnaire (only if deps changed or added)

Answer in the Cargo.toml next to the dependency (or here if updating):

  1. Ownership changes or suspicious concentration?
  2. Low popularity?
  3. Unusual version jump?
  4. Lacking documentation?
  5. Missing CI?
  6. No security / disclosure policy?
  7. Significant size increase?

More details and explanations for the checklist and dependency updates can be found in CONTRIBUTING.md

@cla-bot cla-bot bot added the cla-signed The CLA has been signed. label Dec 9, 2025
@titouantanguy titouantanguy changed the title Titouan/chore/2835/network keepalive chore: network keepalive Dec 9, 2025
@titouantanguy titouantanguy marked this pull request as ready for review December 9, 2025 16:33
@titouantanguy titouantanguy requested review from a team as code owners December 9, 2025 16:33
TOKEN_PORT=4100
KMS_SERVER_CONFIG_FILE="config.toml"
AWS_WEB_IDENTITY_TOKEN_FILE="token"
KEEPIDLE=30
Copy link
Contributor

@mkmks mkmks Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As (in)convenient as it is, one should avoid hardcoding configuration parameters in the enclave init script. There can be rare exceptions (like the ports for the logger and for the token reader) if they're critical for starting the application and aren't supposed to ever change, but keepalive parameters aren't among them, IMHO.

The reason to not do that is that whenever we want to change these parameters, we'll have to rebuild and redeploy the enclave image. It takes at least an hour in our current workflow.

I can accept hardcoded values for testing purposes but if they stay for longer than validating a solution, we have to extract them from the config file with yq in the init script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. makes sense to make them configurable somehow. possibly via ENV vars or arguments to the script

Copy link
Contributor Author

@titouantanguy titouantanguy Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the right place for these parameters in the KMS config.toml file then ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and you already added tcp_keep_alive there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but it works OK because these are expected fields on the KMS app.
Afaict, I can't add a section for socat in the toml file as KMS won't parse it correctly (unless ofc we had a socat struct in our KMS conf?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate section for socat, we can just use the same values we set for kms-server. Is there a good reason to be able to configure the socat keepalive settings separately from the kms-server keepalive settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For KEEPIDLE and KEEPINTVL we could map them to tcp_keep_alive_secs and http2_keep_alive_interval_secs respectively , but there's no equivalent to KEEPCNT (tbh I've also no clue what a sensible value for this would be, so I'd be fine with leaving it default; whatever default is)

@titouantanguy titouantanguy added the docker Commits in PRs with this label trigger the build of docker images in CI label Dec 9, 2025
@github-actions
Copy link

github-actions bot commented Dec 9, 2025

Consolidated Tests Results 2025-12-12 - 10:30:55

Test Results

passed 4 passed

Details

tests 4 tests
clock not captured
tool junit-to-ctrf
build kind-testing arrow-right test-reporter link #360
pull-request chore: network keepalive link #319

test-reporter: Run #360

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Pending ⏳ Other ❓ Flaky 🍂 Duration ⏱️
4 4 0 0 0 0 0 not captured

🎉 All tests passed!

Tests

View All Tests
Test Name Status Flaky Duration
full_gen_tests_default_threshld_sequential_crs 2m 21s
test_threshld_insecure 3m 20s
full_gen_tests_default_centralzd_sequential_crs 1.8s
test_centralzd_insecure 2m 2s

🍂 No flaky tests in this run.

Github Test Reporter by CTRF 💚

🔄 This comment has been updated

@titouantanguy
Copy link
Contributor Author

titouantanguy commented Dec 12, 2025

Note that I removed the 60s timeout on the socat process, as my hope is that the keepalive is better suited for this job (and if we can, I guess we'd rather keep the same socat process than creating a new one?)
But not entirely sure that's a good choice, especially since it looks like the timeout indeed fixed the FD issue ?

@titouantanguy titouantanguy self-assigned this Dec 15, 2025
@dd23
Copy link
Member

dd23 commented Jan 19, 2026

This is slightly outdated and is only needed for the enclave proxies.
Might not be needed with tunnels any longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The CLA has been signed. docker Commits in PRs with this label trigger the build of docker images in CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants