Skip to content

Conversation

chrischdi
Copy link

The controller was for each call of newAWSSession doing the following:

  • creating a random file (usually in /tmp) with a prefix aws-shared-credentials.
  • use that file to setup the session (via aws-sdk-go's session.NewSessionWithOptions
  • delete that file again

This started to fill up the slab caches indefinitely and thus leading to unnecessary memory usage.

While the kernel should probably free that memory, we can prevent even getting into that situation by now:

  • once creating a temporary file
  • re-using that over and over again
  • cover the function with a mutex to protect against concurrent access

Covering with a mutex has the negative point of not being concurrent, but we only have two controllers, both running with a concurrency of 1.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 2, 2025
@openshift-ci-robot
Copy link
Contributor

@chrischdi: This pull request references Jira Issue OCPBUGS-38759, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The controller was for each call of newAWSSession doing the following:

  • creating a random file (usually in /tmp) with a prefix aws-shared-credentials.
  • use that file to setup the session (via aws-sdk-go's session.NewSessionWithOptions
  • delete that file again

This started to fill up the slab caches indefinitely and thus leading to unnecessary memory usage.

While the kernel should probably free that memory, we can prevent even getting into that situation by now:

  • once creating a temporary file
  • re-using that over and over again
  • cover the function with a mutex to protect against concurrent access

Covering with a mutex has the negative point of not being concurrent, but we only have two controllers, both running with a concurrency of 1.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from RadekManak and damdo October 2, 2025 15:55

var (
sharedCredentialsFileMutex sync.Mutex
sharedCredentialsFileName = path.Join(os.TempDir(), "aws-shared-credentials"+rand.String(16))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can rand.String output characters which are invalid in a path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only uses bcdfghjklmnpqrstvwxz2456789

	// We omit vowels from the set of available characters to reduce the chances
	// of "bad words" being formed.
	alphanums = "bcdfghjklmnpqrstvwxz2456789"

So no :-)

}

f, err := ioutil.TempFile("", "aws-shared-credentials")
f, err := os.Create(sharedCredentialsFileName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still writing the file every time, even if it now has the same filename.

If this is enough to stem the bleeding then fine, I guess, but we should put a big comment here explaining why we're not using ioutil.TempFile() and why we shouldn't put it back, preferrably with a link to a RHEL bug.

However, ideally we would return the same file if the contents are the same. How about we give the file some name which is unique to the secret's contents and don't create it if it already exists? A couple of ideas:

  • aws-shared-credentials-[hash of secret contents]
  • aws-shared-credentials-[secret.name]-[secret.uid]-[secret.generation]

The latter seems pretty simple to write. A downside is that over time we accumulate every version of the credentials we have ever seen in the temp directory, but I suspect that would never be an issue in practice and the only actor who can exploit it has no incentive to DoS themselves.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still writing the file every time, even if it now has the same filename.

If this is enough to stem the bleeding then fine, I guess, but we should put a big comment here explaining why we're not using ioutil.TempFile() and why we shouldn't put it back, preferrably with a link to a RHEL bug.

Going to add a comment 👍 agree.

However, ideally we would return the same file if the contents are the same. How about we give the file some name which is unique to the secret's contents and don't create it if it already exists? A couple of ideas:

  • aws-shared-credentials-[hash of secret contents]
  • aws-shared-credentials-[secret.name]-[secret.uid]-[secret.generation]

The latter seems pretty simple to write. A downside is that over time we accumulate every version of the credentials we have ever seen in the temp directory, but I suspect that would never be an issue in practice and the only actor who can exploit it has no incentive to DoS themselves.

I'd prefer to not cache the credentials on disk. If we want to do some caching instead, we might want to cache the session object in memory and not need the whole write/read/delete cycle at all. But I think this goes too far for this PR.

Comment on lines 404 to 406
if len(sessionOptions.SharedConfigFiles) > 0 {
os.Remove(sessionOptions.SharedConfigFiles[0])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not deleting these files in all error paths. Is it possible we're just leaving a lot of files in tmpfs?

Hmm, probably not, given that your reproducer doesn't do this. However, we should probably still fix it if we're going to continue writing these files on ever call. I recommend this should:

  • Be called from a defer() immediately after calling sharedCredentialsFileFromSecret()
  • (and be implemented as an unconditional for loop)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, PTAL

@chrischdi
Copy link
Author

/test e2e-aws-serial
/test okd-scos-e2e-aws-ovn

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2025
@chrischdi
Copy link
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 7, 2025
@openshift-ci-robot
Copy link
Contributor

@chrischdi: This pull request references Jira Issue OCPBUGS-38759, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@chrischdi
Copy link
Author

/test okd-scos-e2e-aws-ovn

@chrischdi
Copy link
Author

/test e2e-aws-serial

4h timeout

@chrischdi
Copy link
Author

/test e2e-aws-serial
/test okd-scos-e2e-aws-ovn

@damdo
Copy link
Member

damdo commented Oct 8, 2025

@chrischdi I can see the serial has passed but failed on the gather phase due to timeout.

I think it might be necessary to shard it (see: openshift/release#64853 for an example)

@sunzhaohua2
Copy link

@huali9 can you help to test this? thanks!

@damdo
Copy link
Member

damdo commented Oct 9, 2025

/assign @mdbooth

Matt are you happy with this? TY

@chrischdi
Copy link
Author

/test help

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Author

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Author

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi chrischdi force-pushed the pr-fix-client-reuse-single-creds-file branch from 80f90e1 to 36e2ff3 Compare October 13, 2025 13:01
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2025
@chrischdi
Copy link
Author

chrischdi commented Oct 13, 2025

(rebased and squashed os it takes the new 1-of-2/2-of-2 jobs for serial to make them succeed...

@damdo
Copy link
Member

damdo commented Oct 13, 2025

/test e2e-aws

@huali9
Copy link

huali9 commented Oct 14, 2025

/verified by @huali9
Details can refer to the bug

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 14, 2025
@openshift-ci-robot
Copy link
Contributor

@huali9: This PR has been marked as verified by @huali9.

In response to this:

/verified by @huali9
Details can refer to the bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@huali9
Copy link

huali9 commented Oct 14, 2025

/test e2e-aws

1 similar comment
@huali9
Copy link

huali9 commented Oct 14, 2025

/test e2e-aws

@chrischdi
Copy link
Author

/test okd-scos-e2e-aws-ovn

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re adding

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2025
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also ensure we're at least reporting errors returned by deferred code. In particular f.Close() and f.Remove() an return errors.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2025
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Oct 14, 2025
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re adding

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2025
@damdo
Copy link
Member

damdo commented Oct 14, 2025

/test e2e-aws

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@chrischdi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-serial 80f90e1 link true /test e2e-aws-serial
ci/prow/e2e-aws 5cab19a link true /test e2e-aws
ci/prow/okd-scos-e2e-aws-ovn 5cab19a link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@damdo
Copy link
Member

damdo commented Oct 14, 2025

/test e2e-aws-serial

Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@damdo: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test e2e-aws-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damdo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants