Skip to content

Conversation

@szybia
Copy link
Contributor

@szybia szybia commented Sep 23, 2025

  • Add roles with DLS stats to security stats, using the same code and format as /_xpack/usage:
{
  "nodes": {
    "NzgeBRW-RyOeVOwGogV3lQ": {
      "roles": {
        "dls": {
          "bit_set_cache": {
            "count": 0,
            "memory": "0b",
            "memory_in_bytes": 0,
            "hits": 0,
            "misses": 0,
            "evictions": 0,
            "hits_time_in_millis": 0,
            "misses_time_in_millis": 0
          }
        }
      }
    }
  }
}

This is a follow up to #134835, which added the (empty) _security/stats API itself.

…-dls

* upstream/main: (188 commits)
  XContent parsing for DocStats and ShardFieldStats (elastic#135016)
  ESQL: Fix multi cluster tests not working with @repeat (elastic#135208)
  ESQL: Enable Lookup Join on Expression Tech Preview (elastic#134952)
  Use gradleProperty instead of findProperty (elastic#135203)
  Update 133980.yaml
  Small fixes
  Adjust commit sizes to reflect change in TieredMergePolicy
  Update 133980.yaml
  Update docs/changelog/133980.yaml
  Use released lucene 10.3.0
  Update to Lucene 10.3.0 RC2
  Aggs: Fix test (elastic#134308)
  correct docs version
  Use last lucene 10.3 snapshot
  [Automated] Update Lucene snapshot to 10.4.0-snapshot-28bd476b1a1
  Update ValuesSourceReaderOperatorTests test condition for changes in lucene (elastic#133807)
  [Automated] Update Lucene snapshot to 10.3.0-snapshot-878a3db9c2d
  [Automated] Update Lucene snapshot to 10.3.0-snapshot-878a3db9c2d
  [Automated] Update Lucene snapshot to 10.3.0-snapshot-d664abb181a
  [Automated] Update Lucene snapshot to 10.3.0-snapshot-de669618c3c
  ...
@szybia szybia added >enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Sep 23, 2025
@szybia szybia marked this pull request as ready for review September 23, 2025 15:57
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Sep 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @szybia, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

…-dls

* upstream/main:
  Bump FLEET_AGENTS_MAPPINGS_VERSION so the new mapping applies on upgrades (elastic#134957)
  [ML] Adding custom headers support openai text embeddings (elastic#134960)
  Fix systemd notify to use a shared arena (elastic#135235)
@szybia szybia requested review from joegallo and tvernum September 23, 2025 16:05
} else {
rolesStore.usageStats(ActionListener.wrap(rolesStatsFuture::complete, rolesStatsFuture::completeExceptionally));
}
return new GetSecurityStatsNodeResponse(clusterService.localNode(), rolesStatsFuture.join());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block a transport thread and NativeRolesStore.usageStats is potentially expensive.

Rather than using a future, we should override nodeOperationAsync instead and make use of the listener.

Copy link
Contributor Author

@szybia szybia Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, unless i'm missing something, from my reading of the code and testing, nodeOperation doesn't run on transport threads, but the threadpool that's defined at the top (management thread pool):

    @Override
    protected GetSecurityStatsNodeResponse nodeOperation(final GetSecurityStatsNodeRequest request, final Task task) {
        if (Transports.isTransportThread(Thread.currentThread())) {
            LOG.info("szy: transport thread");
            throw new IllegalArgumentException("transport thread");
        } else {
            LOG.info("szy: not transport thread");
        }
[2025-09-24T13:48:33,922][INFO ][o.e.x.s.a.s.TransportSecurityStatsAction] [runTask-0] szy: not transport thread: elasticsearch[runTask-0][management][T#1]
[2025-09-24T13:48:33,925][INFO ][o.e.x.s.a.s.TransportSecurityStatsAction] [runTask-1] szy: not transport thread: elasticsearch[runTask-1][management][T#2]
[2025-09-24T13:48:33,924][INFO ][o.e.x.s.a.s.TransportSecurityStatsAction] [runTask-2] szy: not transport thread: elasticsearch[runTask-2][management][T#2]

i also think this is backed up by the fact that there are no overrides in ES of nodeOperationAsync, everything overrides nodeOperation, and assuming some of those do heavy things, and if they were in transport this would've been spotted

based on javadoc, my final understanding of nodeOperationAsync is that it's just to get access to the listener, maybe leading to cleaner code, but i've tried it out and don't see the hype over what i currently have:

    @Override
    protected GetSecurityStatsNodeResponse nodeOperation(final GetSecurityStatsNodeRequest request, final Task task) {
        return null; // never called because we override nodeOperationAsync
    }

    @Override
    protected void nodeOperationAsync(final GetSecurityStatsNodeRequest request,
                                      final Task task,
                                      final ActionListener<GetSecurityStatsNodeResponse> listener) {
        if (rolesStore == null) {
            listener.onResponse(new GetSecurityStatsNodeResponse(clusterService.localNode(), null));
        } else {
            rolesStore.usageStats(
                ActionListener.wrap(
                    stats -> listener.onResponse(new GetSecurityStatsNodeResponse(clusterService.localNode(), stats)),
                    listener::onFailure
                )
            );
        }
    }

thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right it will be a management thread not a transport thread, but we still end up blocking that thread while doing the work on another thread pool.

That risks contention on the management pool, potentially even with deadlocks if the management pool is completely full with threads that are waiting for work from other threads.

my final understanding of nodeOperationAsync is that it's just to get access to the listener, maybe leading to cleaner code

Not for cleaner code, but so you can perform asynchronous operations. You need the listener because you can't produce the required return value on the current thread, you have to return it asynchronously and for that you need the listener (because blocking on a future isn't something we want to do).

To my knowledge there are no other TransportNodesAction that need to do asynchronous work in the node operation which is why there are no implementations of nodeOperationAsync

However, that opens the question of whether we actually should be doing this work on every node. I hadn't really thought about that when I agreed with using the existing usage stats.
The file and dls stats make sense in a per-node response, but the native stats are inherently cluster-wide stats - they're returning information from the security index - and it doesn't make sense to calculate and return the same stats on every node in the cluster.

We should probably rethink that.

Our options (long term, not necessarily to be handled this PR):

  1. Accept that it's OK for every node to calculate the same stats, and stick with what we have. There are two issues here:
    1. It's potentially a performance issue (not necessarily for native role stats, but if we continue down this path it could be)
    2. There's the chance for timing issues and having nodes return different stats. That's pretty confusing - what does it mean for 1 node to have 134 native roles and another node to have 135 native rules?
  2. Calculate them once, but slot that identical value into the JSON response for each node, that overcomes the two issues above, but at the cost of:
    1. It's just weird to pretend these are node stats if they're cluster stats
    2. Now we have the problem of hiding (lying) if 1 node actually has an incorrect view of the stats (e.g. a caching bug). The response would tell you that the node has the same view of the data as all other nodes, but that's not actually the case.
  3. Move some stats to the top level of the JSON (outside of "nodes"). That's harder for client to consume, but more accurate (we could calculate those stats on the coordinating node for the request).

I think option 3 is best, but maybe Data Management has a view on how stats APIs ought to work in these sorts of cases.

Copy link
Contributor Author

@szybia szybia Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very fair points

after having a think, given i'm trying to ship DLS stats in 9.2 (FF 30th Sep)

i think best plan of attack is to just ship the DLS stats which don't seem to be contentious

and then have a think about structure for everything else

i've updated the code

  • it should be extensible to all your ideas above
  • follows _xpack/usage pattern
  • shouldn't be blocking now so i've left it as nodeOperation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: long term options: I had a quick sanity check conversation with @dakrone, and after that conversation I feel safe saying that for the moment that list of options looks good enough that we aren't trapped in some bad state if we release this with only the dls cache stats now and then worry about how to handle (for example) the native stats later.

…-dls

* upstream/main: (100 commits)
  ES|QL: Add FUSE operator tests (elastic#135307)
  [D0CS] Revise connector setup steps in documentation (elastic#135426)
  Fix DiscoveryDisruptionIT.testElectMasterWithLatestVersion (elastic#135396)
  [DOCS] Marks the change point agg as GA (elastic#134898)
  Rework ShardSearchContextId to explain use of searcher id better (elastic#135233)
  [CI] Handle caching bwc dependencies more gracefully (elastic#135417)
  Mute org.elasticsearch.gradle.TestClustersPluginFuncTest override jdk usage via ES_JAVA_HOME for known jdk os incompatibilities elastic#135413
  [Build] update eclipse formatter used by spotless (elastic#135382)
  [Test] Fix typo in build tool tests (elastic#135405)
  Fixes testSnapshotShutdownProgressTracker (elastic#134926)
  Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=1} elastic#135313
  OTLP: remove feature flag (elastic#135401)
  [Docs] Convert asciidoc lifecycle markers into Docs V3 syntax (elastic#135347)
  Mute org.elasticsearch.upgrades.QueryableBuiltInRolesUpgradeIT testBuiltInRolesSyncedOnClusterUpgrade elastic#135194
  Mute org.elasticsearch.upgrades.IndexingIT testIndexing elastic#135407
  Mute org.elasticsearch.upgrades.DataStreamsUpgradeIT testDataStreamValidationDoesNotBreakUpgrade elastic#135406
  [CI] Handle git snapshot BWC versions correctly when calculating jdk fallback (elastic#135399)
  [Build] Update checkstyle from 10.3 to 11.0.1 (elastic#135381)
  Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238
  Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325
  ...
@szybia szybia changed the title Add role stats to _security/stats Add DLS stats to _security/stats Sep 25, 2025
@szybia szybia requested a review from tvernum September 25, 2025 17:09
Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks for iterating - sometimes I wish that the thing that look easy might actually turn out to be easy.

…-dls

* upstream/main: (55 commits)
  Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135525
  Address es819 tsdb doc values format performance bug (elastic#135505)
  Remove obsolete --add-opens from JDK API extractor tool (elastic#135445)
  [CI] Fix MergeWithFailureIT (elastic#135447)
  Increase wait time in AdaptiveAllocationsScalerServiceTests (elastic#135510)
  ES|QL: Handle multi values in FUSE (elastic#135448)
  Mute org.elasticsearch.upgrades.SyntheticSourceRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135512
  Add trust configuration for cross cluster api keys (elastic#134893)
  ESQL: Fix flakiness in SessionUtilsTests (elastic#135375)
  Mute org.elasticsearch.upgrades.LogsdbIndexingRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135511
  Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325
  Require all functions to provide examples (elastic#135094)
  Mute org.elasticsearch.upgrades.SyntheticSourceRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135344
  Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135236
  Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238
  Mute org.elasticsearch.upgrades.LogsdbIndexingRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135327
  Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=3} elastic#135324
  Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=3} elastic#135315
  ESQL: Handle right hand side of Inline Stats coming optimized with LocalRelation shortcut (elastic#135011)
  Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=3} elastic#135237
  ...
@szybia szybia enabled auto-merge (squash) September 26, 2025 10:43
…-dls

* upstream/main:
  [ML] Add ContextualAI inference service (elastic#134933)
@szybia szybia merged commit 30253d9 into elastic:main Sep 26, 2025
40 checks passed
@szybia szybia deleted the security-stats-with-dls branch September 26, 2025 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants