Skip to content

Conversation

@nguyen-andrew
Copy link
Member

@nguyen-andrew nguyen-andrew commented Dec 13, 2025

This PR introduces the security service for Redpanda Admin API v2. Starting with the OIDC-related RPCs, this v2 service initially aims to achieve functional parity with the v1 API by providing the following:

  • OIDC Identity Resolution - Resolves OIDC token identity and returns principal mapping with token expiration
  • Cluster-wide OIDC Key Refresh - Refreshes OIDC keys across all brokers in the cluster
  • Cluster-wide OIDC Session Revocation - Refreshes OIDC keys and revokes all active OIDC sessions cluster-wide

The implementation has the following characteristics:

  • For OIDC key refresh and ODIC session revocation, any broker can receive requests and will broadcast the operation across all nodes in the cluster.
  • For broadcasted/proxied operations, any error produced by a proxied request is returned to the client
  • Session revocation combines key refresh with active session termination

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Features

  • Added OIDC security service to Admin API v2 with identity resolution, cluster-wide key refresh, and session revocation capabilities. Administrators can now manage OIDC authentication security through the new v2 endpoints with proper cluster coordination and error handling.

Copilot AI review requested due to automatic review settings December 13, 2025 07:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates OIDC-related Admin API endpoints from v1 to v2 using ConnectRPC, introducing the new security service for Admin API v2. The implementation provides functional parity with v1 endpoints while adding proper cluster-wide coordination and error aggregation.

Key changes:

  • Introduces v2 SecurityService with three OIDC RPCs: identity resolution, cluster-wide key refresh, and session revocation
  • Implements cluster-wide broadcasting for key refresh and session revocation operations with aggregated error handling
  • Adds comprehensive integration tests validating v2 endpoints alongside existing v1 tests

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/rptest/tests/redpanda_oauth_test.py Adds v2 endpoint tests (resolve identity, refresh keys, revoke sessions) and refactors metrics helpers for reuse
src/v/redpanda/application.cc Registers the new security service in the admin server
src/v/redpanda/admin/services/security.h Defines security service interface with OIDC RPC methods
src/v/redpanda/admin/services/security.cc Implements security service with cluster-wide operation broadcasting and error aggregation
src/v/redpanda/admin/services/BUILD Adds build target for security service library
src/v/redpanda/BUILD Adds security service dependency to redpanda application
proto/redpanda/core/admin/v2/security.proto Defines protobuf schema for security service RPCs
proto/redpanda/core/admin/v2/BUILD Adds build target for security proto library

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 13, 2025

CI test results

test results on build#77822
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
PartitionReassignmentsTest test_reassignments_cancel null integration https://buildkite.com/redpanda/redpanda/builds/77822#019b16a8-20c8-4f42-a2cc-13b1ae5a0406 FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1167, p0=0.7110, reject_threshold=0.0100. adj_baseline=0.3109, p1=0.1331, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionReassignmentsTest&test_method=test_reassignments_cancel
test results on build#77928
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CrashLoopChecksTest test_cloud_storage_client_misconfiguration null integration https://buildkite.com/redpanda/redpanda/builds/77928#019b25c2-c795-4100-8e31-f6edd8f17288 FAIL 0/51 Test PASSES after retries.Inconclusive result after max retries(baseline=0.9539, p0=0.0944, reject_threshold=0.0100. adj_baseline=0.9999, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CrashLoopChecksTest&test_method=test_cloud_storage_client_misconfiguration
CrashLoopChecksTest test_cloud_storage_client_misconfiguration null integration https://buildkite.com/redpanda/redpanda/builds/77928#019b25c4-18b8-4b9f-a403-6f671ba7ace9 FLAKY 2/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.9375, p0=0.8741, reject_threshold=0.0100. adj_baseline=0.9998, p1=0.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CrashLoopChecksTest&test_method=test_cloud_storage_client_misconfiguration
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/77928#019b25c4-18bc-49a9-91cc-ee03a81f05e9 FLAKY 20/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0159, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.0471, p1=0.3812, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_higher_level_migration_api
PartitionReassignmentsTest test_reassignments_cancel null integration https://buildkite.com/redpanda/redpanda/builds/77928#019b25c2-c79c-4ff1-9e2c-076d45e994a5 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0817, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2256, p1=0.0775, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionReassignmentsTest&test_method=test_reassignments_cancel
test results on build#77977
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/77977#019b282f-cfb4-4c82-b053-60c28e559bca FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0580, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1642, p1=0.1664, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

@nguyen-andrew nguyen-andrew marked this pull request as draft December 15, 2025 16:23
Comment on lines 46 to 48
security::oidc::authenticator auth{_controller->get_oidc_service().local()};
auto res = auth.authenticate(
auth_result->get_password().substr(authz_bearer_prefix.length()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This seems redundant to re-authenticate the token (which shoudl have already been authenticated) to just get the expiration timeout. I wonder if the you could either expand the auth_result to also contain that information or maybe expose something in security that can do a simple parse of the provided token.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree, maybe we can expand auth_result to include an auth_payload or something and use that instead instead of having to decode the token again - I'd like to explore that in a follow-on PR if that's okay.

@nguyen-andrew
Copy link
Member Author

nguyen-andrew commented Dec 16, 2025

Force push to address PR comments:

  • Added documentation to the security.proto
  • Updated resolve_oidc_identity to use failed_precondition_exception as suggested.
  • Simplified implementations of refresh_oidc_keys & revoke_oidc_sessions.
  • Updated the Admin V2 tests added in redpanda_oauth_test.py to use the generated client.

@nguyen-andrew nguyen-andrew marked this pull request as ready for review December 16, 2025 05:23
@nguyen-andrew nguyen-andrew self-assigned this Dec 16, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 16, 2025

Retry command for Build#77928

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/crash_loop_checks_test.py::CrashLoopChecksTest.test_cloud_storage_client_misconfiguration

@nguyen-andrew
Copy link
Member Author

Force push to rebase on latest dev.

@nguyen-andrew
Copy link
Member Author

Force push - needed to re-generate ducktape protos with updated proto documentation.

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really nice!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment on lines 1 to 5
from . import broker_pb2
from . import cluster_pb2
from . import internal
from . import kafka_connections_pb2
from . import security_pb2
from . import shadow_link_pb2
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal import was removed. If this module exists and is used elsewhere in the codebase, this removal could break imports. However, since this appears to be part of a refactoring to add the security module and the file is a .pyi stub file, this may be intentional cleanup of an unused import.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how this has been simplified since the first pass. Just one additional nit on top of Mike's feedback.

Creating a new proto file for security-related messages in Admin API v2.
This new proto initially includes definitions for OIDC-related services
and messages.
This commit adds the skeleton implementation of the OIDC security service
for the Redpanda Admin API v2. The service includes unimplemented stubs for
the following methods:
- resolve_oidc_identity
- refresh_oidc_keys
- revoke_oidc_sessions
@nguyen-andrew
Copy link
Member Author

Force push:

  • Addressing PR comments
  • Rebase on dev to update to latest changes to application.cc.

Generate Ducktape protobuf clients to test the new Admin API v2 OIDC
security endpoints.
@nguyen-andrew
Copy link
Member Author

Force push: accidentally forgot to include changes to tests/rptest/clients/admin/proto/redpanda/core/common/v1/tls_pb2.py that the regenerate_ducktape_protos.sh script generated.

Add OIDC identity resolution to Admin API v2, updates application
wiring, and extends oauth tests to validate the new functionality.
Implement refresh_oidc_keys RPC to refresh OIDC keys by calling
any broker. The receiving broker executes the refresh on all local
shards and proxies/broadcasts the request to all other nodes.
Brokers receiving proxied calls only execute locally and do not
rebroadcast to other nodes. The originating node aggregates failures
across all nodes/shards and returns an aggregated error if any
occurred, otherwise success.

Add tests ensuring a single-node call refreshes OIDC keys across
the cluster.
Implement RevokeOidcSessions RPC to refresh OIDC keys and revoke all
active OIDC sessions across the cluster. The receiving broker
refreshes OIDC keys on all local shards, revokes OIDC SASL
sessions with kafka::server, and broadcasts the request to all
other nodes in the cluster, with each broker performing local
key refresh and session revocation. The originating node aggregates
errors across all brokers and returns an aggregated error if any
node fails, otherwise success.

Add comprehensive test coverage validating session revocation with
various authentication scenarios. This enables administrators to
immediately invalidate all OIDC sessions when security policies
change or credentials are compromised.
@nguyen-andrew
Copy link
Member Author

Force push to use datetime.timezone.utc instead of datetime.UTC (python type checker seems run a version that doesn't have datetime.UTC).

Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will leave the final approval to Mike

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@nguyen-andrew nguyen-andrew merged commit 4a94f35 into redpanda-data:dev Dec 17, 2025
30 checks passed
@nguyen-andrew nguyen-andrew deleted the admin-api-v2-oidc branch December 17, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants