Skip to content

Conversation

@jfreden
Copy link
Contributor

@jfreden jfreden commented Sep 24, 2025

During a serverless incident (INC-4832) that was caused by frequent OOM exceptions it was discovered that ~30% of the heap was occupied by ProjectMetadata instances.

The ProjectMetadata instances were retained by a lambda in IndicesPermission, see this example of a path to gc root:
9a9f6dfd-bd11-41ac-a0e2-345a86ba0509

The reason the lambda exists is to make the index access control lazy. Because the lambda is lazy, it will hold on to the reference to ProjectMetadata for the full request life cycle (as opposed to building the index permissions and dropping the reference). This becomes a problem when there are many concurrent searches (index actions requiring us to check index permissions) coupled with frequent ProjectMetadata updates. Since the lambda holds a reference to ProjectMetadata it can't be garbage collected.

I've proven this by:

  1. Adding a sleep to TransportSearchAction to simulate slow searches
  2. Hook up visual vm to Elasticsearch
  3. Launch "slow" searches with ProjectMetadata updates in between (triggered by creating new indices)
  4. Trigger GC manually through visual vm
  5. Observe memory usage by ProjectMetadata while the searches are hanging (to simulate request in flight)

Before any requests

Screenshot 2025-09-24 at 13 52 34

While requests are in flight

Screenshot 2025-09-24 at 14 31 41

Fix

To fix this issue I've moved the part that needed ProjectMetadata outside of the lambda.
ProjectMetadata was needed to resolved failure store indices. With this PR we will do some more work that #88708 tried to remove, but I think it's acceptable for the memory gain.

To validate that this fixed the issue I ran the same test as above and could see that ProjectMetadata could be garbaged collected as soon as authorization was finished.

Screenshot 2025-09-24 at 14 04 37

@jfreden jfreden added :Security/Security Security issues without another label >enhancement labels Sep 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @jfreden, I've created a changelog YAML for you.

@jfreden jfreden marked this pull request as ready for review September 24, 2025 12:37
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Sep 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

Copy link
Contributor

@slobodanadamovic slobodanadamovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Nice job on tracking this down!

@jfreden jfreden added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 29, 2025
@elasticsearchmachine elasticsearchmachine merged commit 651314e into elastic:main Sep 29, 2025
40 checks passed
@jfreden jfreden deleted the fix_authz_memory_issue branch September 29, 2025 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Security/Security Security issues without another label Team:Security Meta label for security team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants