Skip to content

Conversation

chrischdi
Copy link

The controller was for each call of newAWSSession doing the following:

  • creating a random file (usually in /tmp) with a prefix aws-shared-credentials.
  • use that file to setup the session (via aws-sdk-go's session.NewSessionWithOptions
  • delete that file again

This started to fill up the slab caches indefinitely and thus leading to unnecessary memory usage.

While the kernel should probably free that memory, we can prevent even getting into that situation by now:

  • once creating a temporary file
  • re-using that over and over again
  • cover the function with a mutex to protect against concurrent access

Covering with a mutex has the negative point of not being concurrent, but we only have two controllers, both running with a concurrency of 1.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 2, 2025
@openshift-ci-robot
Copy link
Contributor

@chrischdi: This pull request references Jira Issue OCPBUGS-38759, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The controller was for each call of newAWSSession doing the following:

  • creating a random file (usually in /tmp) with a prefix aws-shared-credentials.
  • use that file to setup the session (via aws-sdk-go's session.NewSessionWithOptions
  • delete that file again

This started to fill up the slab caches indefinitely and thus leading to unnecessary memory usage.

While the kernel should probably free that memory, we can prevent even getting into that situation by now:

  • once creating a temporary file
  • re-using that over and over again
  • cover the function with a mutex to protect against concurrent access

Covering with a mutex has the negative point of not being concurrent, but we only have two controllers, both running with a concurrency of 1.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from RadekManak and damdo October 2, 2025 15:55

var (
sharedCredentialsFileMutex sync.Mutex
sharedCredentialsFileName = path.Join(os.TempDir(), "aws-shared-credentials"+rand.String(16))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can rand.String output characters which are invalid in a path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only uses bcdfghjklmnpqrstvwxz2456789

	// We omit vowels from the set of available characters to reduce the chances
	// of "bad words" being formed.
	alphanums = "bcdfghjklmnpqrstvwxz2456789"

So no :-)

}

f, err := ioutil.TempFile("", "aws-shared-credentials")
f, err := os.Create(sharedCredentialsFileName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still writing the file every time, even if it now has the same filename.

If this is enough to stem the bleeding then fine, I guess, but we should put a big comment here explaining why we're not using ioutil.TempFile() and why we shouldn't put it back, preferrably with a link to a RHEL bug.

However, ideally we would return the same file if the contents are the same. How about we give the file some name which is unique to the secret's contents and don't create it if it already exists? A couple of ideas:

  • aws-shared-credentials-[hash of secret contents]
  • aws-shared-credentials-[secret.name]-[secret.uid]-[secret.generation]

The latter seems pretty simple to write. A downside is that over time we accumulate every version of the credentials we have ever seen in the temp directory, but I suspect that would never be an issue in practice and the only actor who can exploit it has no incentive to DoS themselves.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still writing the file every time, even if it now has the same filename.

If this is enough to stem the bleeding then fine, I guess, but we should put a big comment here explaining why we're not using ioutil.TempFile() and why we shouldn't put it back, preferrably with a link to a RHEL bug.

Going to add a comment 👍 agree.

However, ideally we would return the same file if the contents are the same. How about we give the file some name which is unique to the secret's contents and don't create it if it already exists? A couple of ideas:

  • aws-shared-credentials-[hash of secret contents]
  • aws-shared-credentials-[secret.name]-[secret.uid]-[secret.generation]

The latter seems pretty simple to write. A downside is that over time we accumulate every version of the credentials we have ever seen in the temp directory, but I suspect that would never be an issue in practice and the only actor who can exploit it has no incentive to DoS themselves.

I'd prefer to not cache the credentials on disk. If we want to do some caching instead, we might want to cache the session object in memory and not need the whole write/read/delete cycle at all. But I think this goes too far for this PR.

Comment on lines 404 to 406
if len(sessionOptions.SharedConfigFiles) > 0 {
os.Remove(sessionOptions.SharedConfigFiles[0])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not deleting these files in all error paths. Is it possible we're just leaving a lot of files in tmpfs?

Hmm, probably not, given that your reproducer doesn't do this. However, we should probably still fix it if we're going to continue writing these files on ever call. I recommend this should:

  • Be called from a defer() immediately after calling sharedCredentialsFileFromSecret()
  • (and be implemented as an unconditional for loop)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, PTAL

@chrischdi
Copy link
Author

/test e2e-aws-serial
/test okd-scos-e2e-aws-ovn

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2025
@chrischdi
Copy link
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 7, 2025
@openshift-ci-robot
Copy link
Contributor

@chrischdi: This pull request references Jira Issue OCPBUGS-38759, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@chrischdi
Copy link
Author

/test okd-scos-e2e-aws-ovn

@chrischdi
Copy link
Author

/test e2e-aws-serial

4h timeout

@chrischdi
Copy link
Author

/test e2e-aws-serial
/test okd-scos-e2e-aws-ovn

@damdo
Copy link
Member

damdo commented Oct 8, 2025

@chrischdi I can see the serial has passed but failed on the gather phase due to timeout.

I think it might be necessary to shard it (see: openshift/release#64853 for an example)

@sunzhaohua2
Copy link

@huali9 can you help to test this? thanks!

@damdo
Copy link
Member

damdo commented Oct 9, 2025

/assign @mdbooth

Matt are you happy with this? TY

@chrischdi
Copy link
Author

/test help

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Author

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Author

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-upgrade
/test goimports
/test golint
/test govet
/test images
/test okd-scos-images
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-aws-ipi-mapi

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-api-provider-aws-main-e2e-aws
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-operator
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-2of2
pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-upgrade
pull-ci-openshift-machine-api-provider-aws-main-goimports
pull-ci-openshift-machine-api-provider-aws-main-golint
pull-ci-openshift-machine-api-provider-aws-main-govet
pull-ci-openshift-machine-api-provider-aws-main-images
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-api-provider-aws-main-okd-scos-images
pull-ci-openshift-machine-api-provider-aws-main-unit
pull-ci-openshift-machine-api-provider-aws-main-verify-deps

In response to this:

/test pull-ci-openshift-machine-api-provider-aws-main-e2e-aws-serial-1of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi chrischdi force-pushed the pr-fix-client-reuse-single-creds-file branch from 80f90e1 to 36e2ff3 Compare October 13, 2025 13:01
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2025
Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

New changes are detected. LGTM label has been removed.

@chrischdi
Copy link
Author

chrischdi commented Oct 13, 2025

(rebased and squashed os it takes the new 1-of-2/2-of-2 jobs for serial to make them succeed...

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from damdo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damdo
Copy link
Member

damdo commented Oct 13, 2025

/test e2e-aws

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@chrischdi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-serial 80f90e1 link true /test e2e-aws-serial
ci/prow/e2e-aws 36e2ff3 link true /test e2e-aws
ci/prow/okd-scos-e2e-aws-ovn 36e2ff3 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants