Skip to content

Commit 19e9f75

Browse files
authored
Audit log MVP (#7339)
Initial implementation of [RFD 523](https://rfd.shared.oxide.computer/rfd/0523). ## High-level design * Logging an operation has two steps, corresponding to two app layer methods called directly in the request handler: * `audit_log_entry_init`: called before anything else, and if it fails, we bail -- this guarantees nothing can happen without getting logged * `audit_log_entry_complete`: called after the operation succeeds or fails, filling in the row with the success or failure result. Currently we only log the HTTP status code and possibly error message, but we will fill this in further with, e.g., the ID of the created resource (if applicable), and maybe the entire success response. * This log is stored in CockroachDB and not somewhere else (like Clickhouse) because we need an immediate guarantee at write time that the audit log initialization happened before we proceed with the API operation. * The audit log can only be retrieved by fleet viewers at `/v1/system/audit-log` * The audit log list is powered by a SQL view that filters for only completed entries * The audit log list is ordered by `time_completed`, not `time_started`. This turns out to be very important — see the doc comment on `audit_log_list` in `nexus/db-queries/src/db/datastore/audit_log.rs`. * Audit log entries have unique IDs in order to let clients deduplicate them if they fetch overlapping ranges * Timestamps could not be used as the primary key because (a) timestamp collisions are possible, and (b) we are ordering by `time_completed`, but not all entries in the audit log table have non-null `time_completed` ## Operations logged See `nexus/src/external_api/http_entrypoints.rs`. My goal was to start by logging the operations that create sessions and tokens. Eventually I think we want to log pretty much everything that's not a GET. * `login_saml`: last step of SAML login, creates web session * `login_local`: username/password login, creates web session * `device_auth_confirm`: last step of token create * `project_create` and `project_delete` * `instance_create` and `instance_delete` * `disk_create` and `disk_delete` ## Next steps Things that are not in this PR, but which we will want to do soon, possibly as soon as this release. I put the highest priority items first. ### Log ID of created resource For actions that create a resource, like disk or instance create, we need to at least log the ID of the resource created. Even for token and session creation, we can probably log the ID of the created token or session. We may also want to log names if we have them. ### Log display name of user and silo We only have UUIDs for user and silo and they are not very pleasant to work with. It's a lot easier to see what's going on at a glance if we have display names. On top of that, after a user or silo is deleted, there isn't a way to look them up in the API by ID and get that info. ### Auto-complete uncompleted entries Unlike with initialization (because we bail if it fails), we do not have a guarantee that audit log completion runs successfully because we don't want to turn every loggable operation into a saga to enable rollbacks. To deal with this, we will likely need a background job to complete any rows hanging around uncompleted for longer than N minutes or hours. Because these will not have success or error info about the logged operation, we will probably need an explicit third kind of completed entry, like `success`/`error`/`timeout`. ### Versioned log format We may want to indicate breaking changes to the log format so that customers update whatever system is consuming and storing the log. ### Silo-level audit log endpoint In this PR, the audit log can only be retrieved by fleet viewers at a system-level endpoint. We will probably want to allow silo admins to retrieve an audit log scoped to their silo. That will require * A silo-scoped `/v1/audit-log` endpoint accessible only to silo admins that does more or less what the system-level one does, plus `where silo_id = <silo_id>` * A `SiloAuditLog` authz resource alongside `AuditLog` that is tied to a specific silo * More robust logging of the silo an operation takes place in, probably related to the above point about better actor logging on login actions. The external authenticator actor is not in a silo, so currently we are not writing down what silo a login attempt is happening in. ### Log putative user for login operations For failed login attempts we want to know who they were _trying_ to log in as. For SAML login this may not be meaningful as we only get the request from the IdP after login was successful over there, but for password login we could log the username. ### Log full JSON response We may want to go as far as to log the entire JSON response. One minor difficulty I ran into is that Dropshot handles serializing the response struct to JSON, so we don't have access to the serialized thing in the request handlers. Feels like a shame to serialize it twice, but we might have to if we want to write down the response. ### Clean up old entries Background task to delete entries older than N days, as determined by our as-yet-undetermined our retention policy. We need to keep an eye on how fast the table will grow, but it seems we already have some tables that are quite huge compared to this one and we don't clean them up yet, so I'm not too worried about it. We expect customers will want to frequently fetch the log and save it off-rack, so the retention period probably doesn't need to be very long. ### Log a bunch more events Right now the audit log calls are a bit verbose. Dropshot deliberately does not support middleware, which would let us do this kind of thing automatically outside of the handlers. Finding a more ergonomic and less noisy way of doing the audit logging and latency logging might require a declarative macro.
1 parent 811ee9a commit 19e9f75

File tree

34 files changed

+2405
-113
lines changed

34 files changed

+2405
-113
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

common/src/api/external/mod.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -925,6 +925,7 @@ pub enum ResourceType {
925925
Alert,
926926
AlertReceiver,
927927
AllowList,
928+
AuditLogEntry,
928929
BackgroundTask,
929930
BgpConfig,
930931
BgpAnnounceSet,

nexus/auth/src/authn/mod.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,15 @@ impl Context {
151151
&self.schemes_tried
152152
}
153153

154+
/// If the user is authenticated, return the last scheme in the list of
155+
/// schemes tried, which is the one that worked.
156+
pub fn scheme_used(&self) -> Option<&SchemeName> {
157+
match &self.kind {
158+
Kind::Authenticated(..) => self.schemes_tried().last(),
159+
Kind::Unauthenticated => None,
160+
}
161+
}
162+
154163
/// Returns an unauthenticated context for use internally
155164
pub fn internal_unauthenticated() -> Context {
156165
Context { kind: Kind::Unauthenticated, schemes_tried: vec![] }

nexus/auth/src/authz/api_resources.rs

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -407,8 +407,66 @@ impl AuthorizedResource for IpPoolList {
407407
roleset: &'fut mut RoleSet,
408408
) -> futures::future::BoxFuture<'fut, Result<(), Error>> {
409409
// There are no roles on the IpPoolList, only permissions. But we still
410+
// need to load the Fleet-related roles to verify that the actor's role
411+
// on the Fleet (possibly conferred from a Silo role).
412+
load_roles_for_resource_tree(&FLEET, opctx, authn, roleset).boxed()
413+
}
414+
415+
fn on_unauthorized(
416+
&self,
417+
_: &Authz,
418+
error: Error,
419+
_: AnyActor,
420+
_: Action,
421+
) -> Error {
422+
error
423+
}
424+
425+
fn polar_class(&self) -> oso::Class {
426+
Self::get_polar_class()
427+
}
428+
}
429+
430+
// Similar to IpPoolList, the audit log is a collection that doesn't exist in
431+
// the database as an entity distinct from its children (IP pools, or in this
432+
// case, audit log entries). We need a dummy resource here because we need
433+
// something to hang permissions off of. We need to be able to create audit log
434+
// children (entries) for login attempts, when there is no authenticated user,
435+
// as well as for normal requests with an authenticated user. For retrieval, we
436+
// want (to start out) to allow only fleet viewers to list children.
437+
438+
#[derive(Clone, Copy, Debug)]
439+
pub struct AuditLog;
440+
441+
/// Singleton representing the [`AuditLog`] for authz purposes
442+
pub const AUDIT_LOG: AuditLog = AuditLog;
443+
444+
impl Eq for AuditLog {}
445+
446+
impl PartialEq for AuditLog {
447+
fn eq(&self, _: &Self) -> bool {
448+
true
449+
}
450+
}
451+
452+
impl oso::PolarClass for AuditLog {
453+
fn get_polar_class_builder() -> oso::ClassBuilder<Self> {
454+
oso::Class::builder()
455+
.with_equality_check()
456+
.add_attribute_getter("fleet", |_: &AuditLog| FLEET)
457+
}
458+
}
459+
460+
impl AuthorizedResource for AuditLog {
461+
fn load_roles<'fut>(
462+
&'fut self,
463+
opctx: &'fut OpContext,
464+
authn: &'fut authn::Context,
465+
roleset: &'fut mut RoleSet,
466+
) -> futures::future::BoxFuture<'fut, Result<(), Error>> {
467+
// There are no roles on the AuditLog, only permissions. But we still
410468
// need to load the Fleet-related roles to verify that the actor has the
411-
// "admin" role on the Fleet (possibly conferred from a Silo role).
469+
// viewer role on the Fleet (possibly conferred from a Silo role).
412470
load_roles_for_resource_tree(&FLEET, opctx, authn, roleset).boxed()
413471
}
414472

nexus/auth/src/authz/omicron.polar

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -441,6 +441,31 @@ has_relation(fleet: Fleet, "parent_fleet", ip_pool_list: IpPoolList)
441441
has_permission(actor: AuthenticatedActor, "create_child", ip_pool: IpPool)
442442
if silo in actor.silo and silo.fleet = ip_pool.fleet;
443443

444+
# Describes the policy for reading and writing the audit log
445+
resource AuditLog {
446+
permissions = [
447+
"list_children", # retrieve audit log
448+
"create_child", # create audit log entry
449+
];
450+
451+
relations = { parent_fleet: Fleet };
452+
453+
# Fleet viewers can read the audit log
454+
"list_children" if "viewer" on "parent_fleet";
455+
}
456+
457+
# Any actor should be able to write to the audit log because we need to be able
458+
# to write to the log from any request, authenticated or not. Audit log writes
459+
# are always a byproduct of other operations: there are no endpoints that allow
460+
# the user to write to the log deliberately. Note we use AuthenticatedActor
461+
# because we don't really mean unauthenticated -- in the case of login
462+
# operations, we use the external authenticator actor that creates the session
463+
# to authorize the audit log write.
464+
has_permission(_actor: AuthenticatedActor, "create_child", _audit_log: AuditLog);
465+
466+
has_relation(fleet: Fleet, "parent_fleet", audit_log: AuditLog)
467+
if audit_log.fleet = fleet;
468+
444469
# Describes the policy for creating and managing web console sessions.
445470
resource ConsoleSessionList {
446471
permissions = [ "create_child" ];

nexus/auth/src/authz/oso_generic.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ pub fn make_omicron_oso(log: &slog::Logger) -> Result<OsoInit, anyhow::Error> {
101101
let classes = [
102102
// Hand-written classes
103103
Action::get_polar_class(),
104+
AuditLog::get_polar_class(),
104105
AnyActor::get_polar_class(),
105106
AuthenticatedActor::get_polar_class(),
106107
BlueprintConfig::get_polar_class(),

0 commit comments

Comments
 (0)