feat: define resource indexer architecture

scotwells · scotwells · commit 5af8c63d9cc3 · 2026-01-26T17:36:23.000-06:00
Creates a new architecture document that goes into detail on the design
of the indexing service.
diff --git a/README.md b/README.md
@@ -8,3 +8,10 @@ using CEL-based filtering. The service integrates natively with kubectl/RBAC and
 targets Meilisearch as the search backend.
 
 ![](./docs/diagrams/SearchServiceContext.png)
+
+## Documentation
+
+- [Architecture](./docs/architecture.md) — High-level design and component
+  overview
+- [Resource Indexer](./docs/components/resource-indexer.md) — Detailed design
+  for the indexing component
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -52,6 +52,9 @@ using powerful indexing and real-time event processing.
 - Manage index lifecycle (creation, updates, deletion)
 - Bootstrap indexes from existing state
 
+See the [Resource Indexer Architecture](./components/resource-indexer.md) for
+detailed design documentation.
+
 ### Controller Manager
 
 **Purpose**: Manages and validates resources for the search service
diff --git a/docs/components/resource-indexer.md b/docs/components/resource-indexer.md
@@ -0,0 +1,198 @@
+<!-- omit from toc -->
+# Resource Indexer Architecture
+
+- [Overview](#overview)
+- [Design Goals](#design-goals)
+- [Core Responsibilities](#core-responsibilities)
+- [Event Consumption](#event-consumption)
+  - [Horizontal Scaling](#horizontal-scaling)
+- [Policy Management](#policy-management)
+  - [CEL Compilation](#cel-compilation)
+- [Document Transformation](#document-transformation)
+- [Persistence and Acknowledgment](#persistence-and-acknowledgment)
+  - [Batching](#batching)
+  - [Duplicate Handling](#duplicate-handling)
+- [Bootstrap Process](#bootstrap-process)
+- [Error Handling](#error-handling)
+- [Integration Points](#integration-points)
+- [Future Considerations](#future-considerations)
+
+
+## Overview
+
+The Resource Indexer is a core component of the Search service responsible for
+maintaining a searchable index of platform resources. It consumes audit log
+events from NATS JetStream, applies policy-based filtering, and writes indexed
+documents to the search backend.
+
+## Design Goals
+
+- **Real-time indexing**: Process resource changes within seconds of occurrence
+- **Policy-driven**: Index only resources matching active IndexPolicy
+  configurations
+- **Reliable delivery**: Guarantee at-least-once processing of all events
+- **Graceful recovery**: Resume processing from last known position after
+  restarts
+- **Minimal resource footprint**: Operate efficiently within constrained
+  environments
+
+## Core Responsibilities
+
+The Resource Indexer handles:
+
+- Consuming audit log events from NATS JetStream
+- Watching IndexPolicy resources and evaluating CEL filters
+- Transforming Kubernetes resources into searchable documents
+- Persisting documents to the index backend
+- Acknowledging events only after successful persistence
+
+## Event Consumption
+
+The indexer consumes audit log events from NATS JetStream using durable
+consumers. JetStream provides:
+
+- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts
+- **Position tracking**: Durable consumers track acknowledged messages; on
+  restart, consumption resumes from the last acknowledged position
+- **Backpressure**: Pull-based consumption allows the indexer to control its
+  processing rate
+
+### Horizontal Scaling
+
+The indexer uses JetStream [queue groups] for horizontal scaling. When multiple
+instances join the same queue group, JetStream distributes messages across them
+automatically — each message is delivered to exactly one instance.
+
+```
+                       Queue Group: "resource-indexer"
+                                  │
+          ┌───────────────────────┼───────────────────────┐
+          │                       │                       │
+          ▼                       ▼                       ▼
+   ┌──────────────┐       ┌──────────────┐       ┌──────────────┐
+   │  Indexer #1  │       │  Indexer #2  │       │  Indexer #3  │
+   └──────────────┘       └──────────────┘       └──────────────┘
+```
+
+This enables linear throughput scaling without coordination between instances.
+
+[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue
+
+## Policy Management
+
+IndexPolicy resources define what to index. The indexer watches these resources
+using a Kubernetes [informer], which provides:
+
+- **List-watch semantics**: Initial list of all policies followed by a watch
+  stream for changes
+- **Local cache**: In-memory store for fast lookups during event processing
+- **Automatic resync**: Periodic re-list to correct any drift
+
+Each indexer instance maintains its own policy cache. Since events can be routed
+to any instance (via queue groups), each instance caches all policies.
+IndexPolicy resources are typically small and few in number, so this
+replication is acceptable.
+
+### CEL Compilation
+
+[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid
+recompilation on every event, compile expressions when policies are added or
+updated and cache the compiled programs alongside the policy.
+
+The indexer should wait for the informer cache to sync before processing events
+to ensure all active policies are available for matching.
+
+[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer
+[CEL]: https://cel.dev
+
+## Document Transformation
+
+When an event matches a policy, the indexer transforms the Kubernetes resource
+into a searchable document:
+
+- Extract fields specified in the IndexPolicy field mappings
+- Normalize metadata (labels, annotations) into searchable formats
+- Use the resource's UID as the document identifier
+
+## Persistence and Acknowledgment
+
+Documents are persisted to the index backend (Meilisearch). To guarantee
+at-least-once delivery, events are only acknowledged after successful
+persistence.
+
+### Batching
+
+For efficiency, batch multiple documents into a single write request. When a
+batch completes:
+
+1. Persist all documents to the index backend
+2. On success, acknowledge all events in the batch
+3. On failure, do not acknowledge — JetStream redelivers after ack timeout
+
+Events that don't match any policy should be acknowledged immediately to prevent
+reprocessing.
+
+### Duplicate Handling
+
+At-least-once delivery means duplicates are possible (e.g., after a failure
+before acknowledgment). The index backend handles this via document ID upserts —
+reindexing the same resource overwrites the existing document.
+
+## Bootstrap Process
+
+On startup or when a new IndexPolicy is created, the indexer must populate the
+index with existing resources. The platform spans multiple project control
+planes, so bootstrap must list resources from each cluster.
+
+### Multi-Cluster Bootstrap
+
+The indexer uses the [multicluster-runtime] provider pattern to discover
+project control planes. For each discovered cluster:
+
+1. List resources matching the policy selector from that cluster's API
+2. Transform and index each resource
+3. Handle concurrent modifications during bootstrap gracefully
+
+The provider handles dynamic cluster discovery — as clusters come online or go
+offline, the indexer bootstraps or cleans up accordingly.
+
+After bootstrap completes, real-time indexing continues via the JetStream event
+stream, which already aggregates events from all control planes.
+
+[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime
+
+## Error Handling
+
+- **Transient failures**: Retry with exponential backoff for network errors and
+  temporary unavailability
+- **Malformed events**: Log and skip events that cannot be parsed; acknowledge
+  to prevent redelivery loops
+- **Backend unavailability**: Buffer events in memory (bounded) while attempting
+  reconnection; pause consumption if buffer fills
+- **Policy evaluation errors**: Log and skip events with CEL evaluation
+  failures; do not block processing of other events
+
+## Integration Points
+
+| System | Protocol | Purpose |
+|--------|----------|---------|
+| NATS JetStream | NATS | Consume audit log events (aggregated from all clusters) |
+| Search API Server | HTTPS | Watch IndexPolicy resources |
+| Project Control Planes | HTTPS | Bootstrap existing resources |
+| Meilisearch | HTTPS/JSON | Persist indexed documents |
+
+## Future Considerations
+
+- **Control plane deletion**: When a project control plane is deleted, indexed
+  resources from that cluster must be cleaned up. Ideally, the platform emits
+  deletion events for all resources before the control plane is removed,
+  allowing event-driven cleanup. If this isn't guaranteed, the indexer may need
+  to track source cluster metadata and delete documents when a cluster is
+  disengaged.
+- **Dead letter handling**: Route persistently failing events to a dead letter
+  queue for manual inspection
+- **Metrics and observability**: Expose indexing lag, throughput, and error
+  rates via Prometheus
+- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing
+- **Policy-based sharding**: For very large deployments, assign subsets of
+  policies to instances using consistent hashing