feat: define resource indexer architecture

scotwells · scotwells · commit 1fba80f725b0 · 2026-01-26T18:03:16.000-06:00
Creates a new architecture document that goes into detail on the design
of the indexing service.
diff --git a/README.md b/README.md
@@ -8,3 +8,10 @@ using CEL-based filtering. The service integrates natively with kubectl/RBAC and
 targets Meilisearch as the search backend.
 
 ![](./docs/diagrams/SearchServiceContext.png)
+
+## Documentation
+
+- [Architecture](./docs/architecture.md) — High-level design and component
+  overview
+- [Resource Indexer](./docs/components/resource-indexer.md) — Detailed design
+  for the indexing component
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -52,6 +52,9 @@ using powerful indexing and real-time event processing.
 - Manage index lifecycle (creation, updates, deletion)
 - Bootstrap indexes from existing state
 
+See the [Resource Indexer Architecture](./components/resource-indexer.md) for
+detailed design documentation.
+
 ### Controller Manager
 
 **Purpose**: Manages and validates resources for the search service
diff --git a/docs/components/resource-indexer.md b/docs/components/resource-indexer.md
@@ -0,0 +1,246 @@
+<!-- omit from toc -->
+# Resource Indexer Architecture
+
+- [Overview](#overview)
+- [Design Goals](#design-goals)
+- [Core Responsibilities](#core-responsibilities)
+- [Event Consumption](#event-consumption)
+  - [Horizontal Scaling](#horizontal-scaling)
+- [Policy Management](#policy-management)
+  - [CEL Compilation](#cel-compilation)
+- [Document Transformation](#document-transformation)
+- [Persistence and Acknowledgment](#persistence-and-acknowledgment)
+  - [Batching](#batching)
+  - [Duplicate Handling](#duplicate-handling)
+- [Bootstrap Process](#bootstrap-process)
+  - [Multi-Cluster Bootstrap](#multi-cluster-bootstrap)
+- [Error Handling](#error-handling)
+- [Integration Points](#integration-points)
+- [Future Considerations](#future-considerations)
+
+
+## Overview
+
+The Resource Indexer is a core component of the Search service responsible for
+maintaining a searchable index of platform resources. It consumes audit log
+events from NATS JetStream, applies policy-based filtering, and writes indexed
+documents to the search backend.
+
+## Design Goals
+
+- **Real-time indexing**: Process resource changes within seconds of occurrence
+- **Policy-driven**: Index only resources matching active IndexPolicy
+  configurations
+- **Reliable delivery**: Guarantee at-least-once processing of all events
+- **Graceful recovery**: Resume processing from last known position after
+  restarts
+- **Horizontal scalability**: Scale throughput by adding instances without
+  coordination
+- **Minimal resource footprint**: Operate efficiently within constrained
+  environments
+
+## Core Responsibilities
+
+The Resource Indexer handles:
+
+- Consuming audit log events from NATS JetStream
+- Watching IndexPolicy resources and evaluating CEL filters
+- Transforming Kubernetes resources into searchable documents
+- Persisting documents to the index backend
+- Acknowledging events only after successful persistence
+
+### Event Processing Flow
+
+The following diagram illustrates how the indexer processes events, including
+policy matching, batching, and acknowledgment handling:
+
+```mermaid
+sequenceDiagram
+    participant JS as NATS JetStream
+    participant Indexer as Resource Indexer
+    participant Cache as Policy Cache
+    participant Meili as Meilisearch
+
+    rect rgb(240, 248, 255)
+        note right of JS: Event Matches Policy
+        JS->>Indexer: Deliver audit event
+        Indexer->>Cache: Evaluate policies
+        Cache-->>Indexer: Policy match + compiled CEL
+        Indexer->>Indexer: Evaluate CEL filter
+        Indexer->>Indexer: Transform resource to document
+        Indexer->>Indexer: Add to batch
+
+        alt Batch ready (size or time threshold)
+            Indexer->>Meili: Persist document batch
+            Meili-->>Indexer: Success
+            Indexer->>JS: Ack all events in batch
+        end
+    end
+
+    rect rgb(255, 248, 240)
+        note right of JS: Event Does Not Match Policy
+        JS->>Indexer: Deliver audit event
+        Indexer->>Cache: Evaluate policies
+        Cache-->>Indexer: No matching policy
+        Indexer->>JS: Ack (discard event)
+    end
+
+    rect rgb(255, 240, 240)
+        note right of JS: Persistence Failure
+        JS->>Indexer: Deliver audit event
+        Indexer->>Indexer: Transform and batch
+        Indexer->>Meili: Persist document batch
+        Meili-->>Indexer: Error
+        note right of Indexer: Do not ack — JetStream<br/>redelivers after timeout
+    end
+```
+
+## Event Consumption
+
+The indexer consumes audit log events from [NATS JetStream][jetstream] using
+[durable consumers][durable-consumers]. JetStream provides:
+
+- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts
+- **Position tracking**: Durable consumers track acknowledged messages; on
+  restart, consumption resumes from the last acknowledged position
+- **Backpressure**: Pull-based consumption allows the indexer to control its
+  processing rate
+
+[jetstream]: https://docs.nats.io/nats-concepts/jetstream
+[durable-consumers]: https://docs.nats.io/nats-concepts/jetstream/consumers#durable-consumers
+
+### Horizontal Scaling
+
+The indexer uses JetStream [queue groups] for horizontal scaling. When multiple
+instances join the same queue group, JetStream distributes messages across them
+automatically — each message is delivered to exactly one instance.
+
+![Resource Indexer horizontal scaling diagram](../diagrams/ResourceIndexerScaling.png)
+
+This enables linear throughput scaling without coordination between instances.
+
+[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue
+
+## Policy Management
+
+IndexPolicy resources define what to index. The indexer watches these resources
+using a Kubernetes [informer], which provides:
+
+- **List-watch semantics**: Initial list of all policies followed by a watch
+  stream for changes
+- **Local cache**: In-memory store for fast lookups during event processing
+- **Automatic resync**: Periodic re-list to correct any drift
+
+Each indexer instance maintains its own policy cache. Since events can be routed
+to any instance (via queue groups), each instance caches all policies.
+IndexPolicy resources are typically small and few in number, so this
+replication is acceptable.
+
+### CEL Compilation
+
+[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid
+recompilation on every event, compile expressions when policies are added or
+updated and cache the compiled programs alongside the policy.
+
+The indexer should wait for the informer cache to sync before processing events
+to ensure all active policies are available for matching.
+
+[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer
+[CEL]: https://cel.dev
+
+## Document Transformation
+
+When an event matches a policy, the indexer transforms the Kubernetes resource
+into a searchable document:
+
+- Extract fields specified in the IndexPolicy field mappings
+- Normalize metadata (labels, annotations) into searchable formats
+- Use the resource's UID as the document identifier
+
+## Persistence and Acknowledgment
+
+Documents are persisted to the index backend ([Meilisearch]). To guarantee
+at-least-once delivery, events are only acknowledged after successful
+persistence.
+
+[Meilisearch]: https://www.meilisearch.com/docs
+
+### Batching
+
+For efficiency, batch multiple documents into a single write request. When a
+batch completes:
+
+1. Persist all documents to the index backend
+2. On success, acknowledge all events in the batch
+3. On failure, do not acknowledge — JetStream redelivers after ack timeout
+
+Events that don't match any policy should be acknowledged immediately to prevent
+reprocessing.
+
+### Duplicate Handling
+
+At-least-once delivery means duplicates are possible (e.g., after a failure
+before acknowledgment). The index backend handles this via [document primary
+keys][meilisearch-primary-key] — reindexing the same resource overwrites the
+existing document.
+
+[meilisearch-primary-key]: https://www.meilisearch.com/docs/learn/core_concepts/primary_key
+
+## Bootstrap Process
+
+On startup or when a new IndexPolicy is created, the indexer must populate the
+index with existing resources. The platform spans multiple project control
+planes, so bootstrap must list resources from each cluster.
+
+### Multi-Cluster Bootstrap
+
+The indexer uses the [multicluster-runtime] provider pattern to discover
+project control planes. For each discovered cluster:
+
+1. List resources matching the policy selector from that cluster's API
+2. Transform and index each resource
+3. Handle concurrent modifications during bootstrap gracefully
+
+The provider handles dynamic cluster discovery — as clusters come online or go
+offline, the indexer bootstraps or cleans up accordingly.
+
+After bootstrap completes, real-time indexing continues via the JetStream event
+stream, which already aggregates events from all control planes.
+
+[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime
+
+## Error Handling
+
+- **Transient failures**: Retry with exponential backoff for network errors and
+  temporary unavailability
+- **Malformed events**: Log and skip events that cannot be parsed; acknowledge
+  to prevent redelivery loops
+- **Backend unavailability**: Buffer events in memory (bounded) while attempting
+  reconnection; pause consumption if buffer fills
+- **Policy evaluation errors**: Log and skip events with CEL evaluation
+  failures; do not block processing of other events
+
+## Integration Points
+
+| System | Protocol | Purpose |
+|--------|----------|---------|
+| [NATS JetStream][jetstream] | NATS | Consume audit log events (aggregated from all clusters) |
+| Search API Server | HTTPS | Watch IndexPolicy resources |
+| Project Control Planes | HTTPS | Bootstrap existing resources |
+| [Meilisearch] | HTTPS/JSON | Persist indexed documents |
+
+## Future Considerations
+
+- **Control plane deletion**: When a project control plane is deleted, indexed
+  resources from that cluster must be cleaned up. Ideally, the platform emits
+  deletion events for all resources before the control plane is removed,
+  allowing event-driven cleanup. If this isn't guaranteed, the indexer may need
+  to track source cluster metadata and delete documents when a cluster is
+  disengaged.
+- **Dead letter handling**: Route persistently failing events to a dead letter
+  queue for manual inspection
+- **Metrics and observability**: Expose indexing lag, throughput, and error
+  rates via Prometheus
+- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing
+- **Policy-based sharding**: For very large deployments, assign subsets of
+  policies to instances using consistent hashing
diff --git a/docs/diagrams/ResourceIndexerScaling.png b/docs/diagrams/ResourceIndexerScaling.png
diff --git a/docs/diagrams/resource-indexer-scaling.puml b/docs/diagrams/resource-indexer-scaling.puml
@@ -0,0 +1,26 @@
+@startuml ResourceIndexerScaling
+!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Container.puml
+
+title Resource Indexer Horizontal Scaling
+
+System_Ext(jetstream, "NATS JetStream", "Audit log event stream aggregated from all control planes")
+
+System_Boundary(indexerGroup, "Queue Group: resource-indexer") {
+    Container(indexer1, "Indexer #1", "Go", "Processes subset of events")
+    Container(indexer2, "Indexer #2", "Go", "Processes subset of events")
+    Container(indexer3, "Indexer #3", "Go", "Processes subset of events")
+}
+
+System_Ext(meilisearch, "Meilisearch", "Search index backend")
+
+Rel_D(jetstream, indexer1, "Delivers event", "NATS")
+Rel_D(jetstream, indexer2, "Delivers event", "NATS")
+Rel_D(jetstream, indexer3, "Delivers event", "NATS")
+
+Rel_D(indexer1, meilisearch, "Writes documents", "HTTPS")
+Rel_D(indexer2, meilisearch, "Writes documents", "HTTPS")
+Rel_D(indexer3, meilisearch, "Writes documents", "HTTPS")
+
+SHOW_LEGEND()
+
+@enduml