-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add DLS stats to _security/stats
#135271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DLS stats to _security/stats
#135271
Conversation
…-dls * upstream/main: (188 commits) XContent parsing for DocStats and ShardFieldStats (elastic#135016) ESQL: Fix multi cluster tests not working with @repeat (elastic#135208) ESQL: Enable Lookup Join on Expression Tech Preview (elastic#134952) Use gradleProperty instead of findProperty (elastic#135203) Update 133980.yaml Small fixes Adjust commit sizes to reflect change in TieredMergePolicy Update 133980.yaml Update docs/changelog/133980.yaml Use released lucene 10.3.0 Update to Lucene 10.3.0 RC2 Aggs: Fix test (elastic#134308) correct docs version Use last lucene 10.3 snapshot [Automated] Update Lucene snapshot to 10.4.0-snapshot-28bd476b1a1 Update ValuesSourceReaderOperatorTests test condition for changes in lucene (elastic#133807) [Automated] Update Lucene snapshot to 10.3.0-snapshot-878a3db9c2d [Automated] Update Lucene snapshot to 10.3.0-snapshot-878a3db9c2d [Automated] Update Lucene snapshot to 10.3.0-snapshot-d664abb181a [Automated] Update Lucene snapshot to 10.3.0-snapshot-de669618c3c ...
|
Hi @szybia, I've created a changelog YAML for you. |
|
Pinging @elastic/es-security (Team:Security) |
…-dls * upstream/main: Bump FLEET_AGENTS_MAPPINGS_VERSION so the new mapping applies on upgrades (elastic#134957) [ML] Adding custom headers support openai text embeddings (elastic#134960) Fix systemd notify to use a shared arena (elastic#135235)
x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/security/stats/10_skeleton.yml
Show resolved
Hide resolved
...va/org/elasticsearch/xpack/core/security/action/stats/GetSecurityStatsNodeResponseTests.java
Outdated
Show resolved
Hide resolved
| } else { | ||
| rolesStore.usageStats(ActionListener.wrap(rolesStatsFuture::complete, rolesStatsFuture::completeExceptionally)); | ||
| } | ||
| return new GetSecurityStatsNodeResponse(clusterService.localNode(), rolesStatsFuture.join()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will block a transport thread and NativeRolesStore.usageStats is potentially expensive.
Rather than using a future, we should override nodeOperationAsync instead and make use of the listener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, unless i'm missing something, from my reading of the code and testing, nodeOperation doesn't run on transport threads, but the threadpool that's defined at the top (management thread pool):
@Override
protected GetSecurityStatsNodeResponse nodeOperation(final GetSecurityStatsNodeRequest request, final Task task) {
if (Transports.isTransportThread(Thread.currentThread())) {
LOG.info("szy: transport thread");
throw new IllegalArgumentException("transport thread");
} else {
LOG.info("szy: not transport thread");
}[2025-09-24T13:48:33,922][INFO ][o.e.x.s.a.s.TransportSecurityStatsAction] [runTask-0] szy: not transport thread: elasticsearch[runTask-0][management][T#1]
[2025-09-24T13:48:33,925][INFO ][o.e.x.s.a.s.TransportSecurityStatsAction] [runTask-1] szy: not transport thread: elasticsearch[runTask-1][management][T#2]
[2025-09-24T13:48:33,924][INFO ][o.e.x.s.a.s.TransportSecurityStatsAction] [runTask-2] szy: not transport thread: elasticsearch[runTask-2][management][T#2]i also think this is backed up by the fact that there are no overrides in ES of nodeOperationAsync, everything overrides nodeOperation, and assuming some of those do heavy things, and if they were in transport this would've been spotted
based on javadoc, my final understanding of nodeOperationAsync is that it's just to get access to the listener, maybe leading to cleaner code, but i've tried it out and don't see the hype over what i currently have:
@Override
protected GetSecurityStatsNodeResponse nodeOperation(final GetSecurityStatsNodeRequest request, final Task task) {
return null; // never called because we override nodeOperationAsync
}
@Override
protected void nodeOperationAsync(final GetSecurityStatsNodeRequest request,
final Task task,
final ActionListener<GetSecurityStatsNodeResponse> listener) {
if (rolesStore == null) {
listener.onResponse(new GetSecurityStatsNodeResponse(clusterService.localNode(), null));
} else {
rolesStore.usageStats(
ActionListener.wrap(
stats -> listener.onResponse(new GetSecurityStatsNodeResponse(clusterService.localNode(), stats)),
listener::onFailure
)
);
}
}thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right it will be a management thread not a transport thread, but we still end up blocking that thread while doing the work on another thread pool.
That risks contention on the management pool, potentially even with deadlocks if the management pool is completely full with threads that are waiting for work from other threads.
my final understanding of nodeOperationAsync is that it's just to get access to the listener, maybe leading to cleaner code
Not for cleaner code, but so you can perform asynchronous operations. You need the listener because you can't produce the required return value on the current thread, you have to return it asynchronously and for that you need the listener (because blocking on a future isn't something we want to do).
To my knowledge there are no other TransportNodesAction that need to do asynchronous work in the node operation which is why there are no implementations of nodeOperationAsync
However, that opens the question of whether we actually should be doing this work on every node. I hadn't really thought about that when I agreed with using the existing usage stats.
The file and dls stats make sense in a per-node response, but the native stats are inherently cluster-wide stats - they're returning information from the security index - and it doesn't make sense to calculate and return the same stats on every node in the cluster.
We should probably rethink that.
Our options (long term, not necessarily to be handled this PR):
- Accept that it's OK for every node to calculate the same stats, and stick with what we have. There are two issues here:
- It's potentially a performance issue (not necessarily for
nativerole stats, but if we continue down this path it could be) - There's the chance for timing issues and having nodes return different stats. That's pretty confusing - what does it mean for 1 node to have 134 native roles and another node to have 135 native rules?
- It's potentially a performance issue (not necessarily for
- Calculate them once, but slot that identical value into the JSON response for each node, that overcomes the two issues above, but at the cost of:
- It's just weird to pretend these are node stats if they're cluster stats
- Now we have the problem of hiding (lying) if 1 node actually has an incorrect view of the stats (e.g. a caching bug). The response would tell you that the node has the same view of the data as all other nodes, but that's not actually the case.
- Move some stats to the top level of the JSON (outside of "nodes"). That's harder for client to consume, but more accurate (we could calculate those stats on the coordinating node for the request).
I think option 3 is best, but maybe Data Management has a view on how stats APIs ought to work in these sorts of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very fair points
after having a think, given i'm trying to ship DLS stats in 9.2 (FF 30th Sep)
i think best plan of attack is to just ship the DLS stats which don't seem to be contentious
and then have a think about structure for everything else
i've updated the code
- it should be extensible to all your ideas above
- follows
_xpack/usagepattern - shouldn't be blocking now so i've left it as
nodeOperation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: long term options: I had a quick sanity check conversation with @dakrone, and after that conversation I feel safe saying that for the moment that list of options looks good enough that we aren't trapped in some bad state if we release this with only the dls cache stats now and then worry about how to handle (for example) the native stats later.
…-dls * upstream/main: (100 commits) ES|QL: Add FUSE operator tests (elastic#135307) [D0CS] Revise connector setup steps in documentation (elastic#135426) Fix DiscoveryDisruptionIT.testElectMasterWithLatestVersion (elastic#135396) [DOCS] Marks the change point agg as GA (elastic#134898) Rework ShardSearchContextId to explain use of searcher id better (elastic#135233) [CI] Handle caching bwc dependencies more gracefully (elastic#135417) Mute org.elasticsearch.gradle.TestClustersPluginFuncTest override jdk usage via ES_JAVA_HOME for known jdk os incompatibilities elastic#135413 [Build] update eclipse formatter used by spotless (elastic#135382) [Test] Fix typo in build tool tests (elastic#135405) Fixes testSnapshotShutdownProgressTracker (elastic#134926) Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=1} elastic#135313 OTLP: remove feature flag (elastic#135401) [Docs] Convert asciidoc lifecycle markers into Docs V3 syntax (elastic#135347) Mute org.elasticsearch.upgrades.QueryableBuiltInRolesUpgradeIT testBuiltInRolesSyncedOnClusterUpgrade elastic#135194 Mute org.elasticsearch.upgrades.IndexingIT testIndexing elastic#135407 Mute org.elasticsearch.upgrades.DataStreamsUpgradeIT testDataStreamValidationDoesNotBreakUpgrade elastic#135406 [CI] Handle git snapshot BWC versions correctly when calculating jdk fallback (elastic#135399) [Build] Update checkstyle from 10.3 to 11.0.1 (elastic#135381) Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238 Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325 ...
_security/stats_security/stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for iterating - sometimes I wish that the thing that look easy might actually turn out to be easy.
…-dls
* upstream/main: (55 commits)
Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135525
Address es819 tsdb doc values format performance bug (elastic#135505)
Remove obsolete --add-opens from JDK API extractor tool (elastic#135445)
[CI] Fix MergeWithFailureIT (elastic#135447)
Increase wait time in AdaptiveAllocationsScalerServiceTests (elastic#135510)
ES|QL: Handle multi values in FUSE (elastic#135448)
Mute org.elasticsearch.upgrades.SyntheticSourceRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135512
Add trust configuration for cross cluster api keys (elastic#134893)
ESQL: Fix flakiness in SessionUtilsTests (elastic#135375)
Mute org.elasticsearch.upgrades.LogsdbIndexingRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135511
Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325
Require all functions to provide examples (elastic#135094)
Mute org.elasticsearch.upgrades.SyntheticSourceRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135344
Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=1} elastic#135236
Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238
Mute org.elasticsearch.upgrades.LogsdbIndexingRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135327
Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=3} elastic#135324
Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=3} elastic#135315
ESQL: Handle right hand side of Inline Stats coming optimized with LocalRelation shortcut (elastic#135011)
Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=3} elastic#135237
...
…-dls * upstream/main: [ML] Add ContextualAI inference service (elastic#134933)
roleswith DLS stats to security stats, using the same code and format as/_xpack/usage:{ "nodes": { "NzgeBRW-RyOeVOwGogV3lQ": { "roles": { "dls": { "bit_set_cache": { "count": 0, "memory": "0b", "memory_in_bytes": 0, "hits": 0, "misses": 0, "evictions": 0, "hits_time_in_millis": 0, "misses_time_in_millis": 0 } } } } } }This is a follow up to #134835, which added the (empty)
_security/statsAPI itself.