|
| 1 | +<!-- omit from toc --> |
| 2 | +# Resource Indexer Architecture |
| 3 | + |
| 4 | +- [Overview](#overview) |
| 5 | +- [Design Goals](#design-goals) |
| 6 | +- [Core Responsibilities](#core-responsibilities) |
| 7 | +- [Event Consumption](#event-consumption) |
| 8 | + - [Horizontal Scaling](#horizontal-scaling) |
| 9 | +- [Policy Management](#policy-management) |
| 10 | + - [CEL Compilation](#cel-compilation) |
| 11 | +- [Document Transformation](#document-transformation) |
| 12 | +- [Persistence and Acknowledgment](#persistence-and-acknowledgment) |
| 13 | + - [Batching](#batching) |
| 14 | + - [Duplicate Handling](#duplicate-handling) |
| 15 | +- [Bootstrap Process](#bootstrap-process) |
| 16 | + - [Multi-Cluster Bootstrap](#multi-cluster-bootstrap) |
| 17 | +- [Error Handling](#error-handling) |
| 18 | +- [Integration Points](#integration-points) |
| 19 | +- [Future Considerations](#future-considerations) |
| 20 | + |
| 21 | + |
| 22 | +## Overview |
| 23 | + |
| 24 | +The Resource Indexer is a core component of the Search service responsible for |
| 25 | +maintaining a searchable index of platform resources. It consumes audit log |
| 26 | +events from NATS JetStream, applies policy-based filtering, and writes indexed |
| 27 | +documents to the search backend. |
| 28 | + |
| 29 | +## Design Goals |
| 30 | + |
| 31 | +- **Real-time indexing**: Process resource changes within seconds of occurrence |
| 32 | +- **Policy-driven**: Index only resources matching active IndexPolicy |
| 33 | + configurations |
| 34 | +- **Reliable delivery**: Guarantee at-least-once processing of all events |
| 35 | +- **Graceful recovery**: Resume processing from last known position after |
| 36 | + restarts |
| 37 | +- **Horizontal scalability**: Scale throughput by adding instances without |
| 38 | + coordination |
| 39 | +- **Minimal resource footprint**: Operate efficiently within constrained |
| 40 | + environments |
| 41 | + |
| 42 | +## Core Responsibilities |
| 43 | + |
| 44 | +The Resource Indexer handles: |
| 45 | + |
| 46 | +- Consuming audit log events from NATS JetStream |
| 47 | +- Watching IndexPolicy resources and evaluating CEL filters |
| 48 | +- Transforming Kubernetes resources into searchable documents |
| 49 | +- Persisting documents to the index backend |
| 50 | +- Acknowledging events only after successful persistence |
| 51 | + |
| 52 | +### Event Processing Flow |
| 53 | + |
| 54 | +The following diagram illustrates how the indexer processes events, including |
| 55 | +policy matching, batching, and acknowledgment handling: |
| 56 | + |
| 57 | +```mermaid |
| 58 | +sequenceDiagram |
| 59 | + participant JS as NATS JetStream |
| 60 | + participant Indexer as Resource Indexer |
| 61 | + participant Cache as Policy Cache |
| 62 | + participant Meili as Meilisearch |
| 63 | +
|
| 64 | + rect rgb(240, 248, 255) |
| 65 | + note right of JS: Create/Update Matches Policy |
| 66 | + JS->>Indexer: Deliver audit event |
| 67 | + Indexer->>Cache: Evaluate policies |
| 68 | + Cache-->>Indexer: Policy match + compiled CEL |
| 69 | + Indexer->>Indexer: Evaluate CEL filter |
| 70 | + Indexer->>Indexer: Transform resource to document |
| 71 | + Indexer->>Indexer: Add upsert to batch |
| 72 | +
|
| 73 | + alt Batch ready (size or time threshold) |
| 74 | + Indexer->>Meili: Persist document batch |
| 75 | + Meili-->>Indexer: Success |
| 76 | + Indexer->>JS: Ack all events in batch |
| 77 | + end |
| 78 | + end |
| 79 | +
|
| 80 | + rect rgb(255, 248, 240) |
| 81 | + note right of JS: Update No Longer Matches Policy |
| 82 | + JS->>Indexer: Deliver audit event (update) |
| 83 | + Indexer->>Cache: Evaluate policies |
| 84 | + Cache-->>Indexer: No matching policy |
| 85 | + Indexer->>Indexer: Add delete to batch |
| 86 | + Indexer->>Meili: Delete document by UID |
| 87 | + Meili-->>Indexer: Success |
| 88 | + Indexer->>JS: Ack |
| 89 | + end |
| 90 | +
|
| 91 | + rect rgb(255, 245, 238) |
| 92 | + note right of JS: Resource Deleted |
| 93 | + JS->>Indexer: Deliver audit event (delete) |
| 94 | + Indexer->>Indexer: Add delete to batch |
| 95 | + Indexer->>Meili: Delete document by UID |
| 96 | + Meili-->>Indexer: Success |
| 97 | + Indexer->>JS: Ack |
| 98 | + end |
| 99 | +
|
| 100 | + rect rgb(255, 240, 240) |
| 101 | + note right of JS: Persistence Failure |
| 102 | + JS->>Indexer: Deliver audit event |
| 103 | + Indexer->>Indexer: Transform and batch |
| 104 | + Indexer->>Meili: Persist document batch |
| 105 | + Meili-->>Indexer: Error |
| 106 | + note right of Indexer: Do not ack — JetStream<br/>redelivers after timeout |
| 107 | + end |
| 108 | +``` |
| 109 | + |
| 110 | +## Event Consumption |
| 111 | + |
| 112 | +The indexer consumes audit log events from [NATS JetStream][jetstream] using |
| 113 | +[durable consumers][durable-consumers]. JetStream provides: |
| 114 | + |
| 115 | +- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts |
| 116 | +- **Position tracking**: Durable consumers track acknowledged messages; on |
| 117 | + restart, consumption resumes from the last acknowledged position |
| 118 | +- **Backpressure**: Pull-based consumption allows the indexer to control its |
| 119 | + processing rate |
| 120 | + |
| 121 | +[jetstream]: https://docs.nats.io/nats-concepts/jetstream |
| 122 | +[durable-consumers]: https://docs.nats.io/nats-concepts/jetstream/consumers#durable-consumers |
| 123 | + |
| 124 | +### Horizontal Scaling |
| 125 | + |
| 126 | +The indexer uses JetStream [queue groups] for horizontal scaling. When multiple |
| 127 | +instances join the same queue group, JetStream distributes messages across them |
| 128 | +automatically — each message is delivered to exactly one instance. |
| 129 | + |
| 130 | +<p align="center"> |
| 131 | + <img src="../diagrams/ResourceIndexerScaling.png" alt="Resource Indexer horizontal scaling diagram"> |
| 132 | +</p> |
| 133 | + |
| 134 | +This enables linear throughput scaling without coordination between instances. |
| 135 | + |
| 136 | +[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue |
| 137 | + |
| 138 | +## Policy Management |
| 139 | + |
| 140 | +IndexPolicy resources define what to index. The indexer watches these resources |
| 141 | +using a Kubernetes [informer], which provides: |
| 142 | + |
| 143 | +- **List-watch semantics**: Initial list of all policies followed by a watch |
| 144 | + stream for changes |
| 145 | +- **Local cache**: In-memory store for fast lookups during event processing |
| 146 | +- **Automatic resync**: Periodic re-list to correct any drift |
| 147 | + |
| 148 | +Each indexer instance maintains its own policy cache. Since events can be routed |
| 149 | +to any instance (via queue groups), each instance caches all policies. |
| 150 | +IndexPolicy resources are typically small and few in number, so this |
| 151 | +replication is acceptable. |
| 152 | + |
| 153 | +### CEL Compilation |
| 154 | + |
| 155 | +[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid |
| 156 | +recompilation on every event, compile expressions when policies are added or |
| 157 | +updated and cache the compiled programs alongside the policy. |
| 158 | + |
| 159 | +The indexer should wait for the informer cache to sync before processing events |
| 160 | +to ensure all active policies are available for matching. |
| 161 | + |
| 162 | +[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer |
| 163 | +[CEL]: https://cel.dev |
| 164 | + |
| 165 | +## Document Lifecycle |
| 166 | + |
| 167 | +The indexer manages documents in the search index based on audit events: |
| 168 | + |
| 169 | +| Event Type | Policy Match | Action | |
| 170 | +|------------|--------------|--------| |
| 171 | +| Create | Yes | Upsert document | |
| 172 | +| Update | Yes | Upsert document | |
| 173 | +| Update | No | Delete document (was previously indexed) | |
| 174 | +| Delete | — | Delete document | |
| 175 | + |
| 176 | +When a resource is updated and no longer matches any policy (e.g., labels |
| 177 | +changed, CEL filter no longer passes), the indexer deletes the document from the |
| 178 | +index. This ensures the index doesn't accumulate stale documents for resources |
| 179 | +that no longer meet indexing criteria. |
| 180 | + |
| 181 | +### Transformation |
| 182 | + |
| 183 | +When an event matches a policy, the indexer transforms the Kubernetes resource |
| 184 | +into a searchable document: |
| 185 | + |
| 186 | +- Extract fields specified in the IndexPolicy field mappings |
| 187 | +- Normalize metadata (labels, annotations) into searchable formats |
| 188 | +- Use the resource's UID as the document identifier |
| 189 | + |
| 190 | +## Persistence and Acknowledgment |
| 191 | + |
| 192 | +Documents are persisted to the index backend ([Meilisearch]). To guarantee |
| 193 | +at-least-once delivery, events are only acknowledged after successful |
| 194 | +persistence. |
| 195 | + |
| 196 | +[Meilisearch]: https://www.meilisearch.com/docs |
| 197 | + |
| 198 | +### Batching |
| 199 | + |
| 200 | +For efficiency, batch multiple documents into a single write request. When a |
| 201 | +batch completes: |
| 202 | + |
| 203 | +1. Persist all documents to the index backend |
| 204 | +2. On success, acknowledge all events in the batch |
| 205 | +3. On failure, do not acknowledge — JetStream redelivers after ack timeout |
| 206 | + |
| 207 | +Events that don't match any policy should be acknowledged immediately to prevent |
| 208 | +reprocessing. |
| 209 | + |
| 210 | +### Duplicate Handling |
| 211 | + |
| 212 | +At-least-once delivery means duplicates are possible (e.g., after a failure |
| 213 | +before acknowledgment). The index backend handles this via [document primary |
| 214 | +keys][meilisearch-primary-key] — reindexing the same resource overwrites the |
| 215 | +existing document. |
| 216 | + |
| 217 | +[meilisearch-primary-key]: https://www.meilisearch.com/docs/learn/core_concepts/primary_key |
| 218 | + |
| 219 | +## Bootstrap Process |
| 220 | + |
| 221 | +On startup or when a new IndexPolicy is created, the indexer must populate the |
| 222 | +index with existing resources. The platform spans multiple project control |
| 223 | +planes, so bootstrap must list resources from each cluster. |
| 224 | + |
| 225 | +### Multi-Cluster Bootstrap |
| 226 | + |
| 227 | +The indexer uses the [multicluster-runtime] provider pattern to discover |
| 228 | +project control planes. For each discovered cluster: |
| 229 | + |
| 230 | +1. List resources matching the policy selector from that cluster's API |
| 231 | +2. Transform and index each resource |
| 232 | +3. Handle concurrent modifications during bootstrap gracefully |
| 233 | + |
| 234 | +The provider handles dynamic cluster discovery — as clusters come online or go |
| 235 | +offline, the indexer bootstraps or cleans up accordingly. |
| 236 | + |
| 237 | +After bootstrap completes, real-time indexing continues via the JetStream event |
| 238 | +stream, which already aggregates events from all control planes. |
| 239 | + |
| 240 | +[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime |
| 241 | + |
| 242 | +## Error Handling |
| 243 | + |
| 244 | +- **Transient failures**: Retry with exponential backoff for network errors and |
| 245 | + temporary unavailability |
| 246 | +- **Malformed events**: Log and skip events that cannot be parsed; acknowledge |
| 247 | + to prevent redelivery loops |
| 248 | +- **Backend unavailability**: Buffer events in memory (bounded) while attempting |
| 249 | + reconnection; pause consumption if buffer fills |
| 250 | +- **Policy evaluation errors**: Log and skip events with CEL evaluation |
| 251 | + failures; do not block processing of other events |
| 252 | + |
| 253 | +## Integration Points |
| 254 | + |
| 255 | +| System | Protocol | Purpose | |
| 256 | +|--------|----------|---------| |
| 257 | +| [NATS JetStream][jetstream] | NATS | Consume audit log events (aggregated from all clusters) | |
| 258 | +| Search API Server | HTTPS | Watch IndexPolicy resources | |
| 259 | +| Project Control Planes | HTTPS | Bootstrap existing resources | |
| 260 | +| [Meilisearch] | HTTPS/JSON | Persist indexed documents | |
| 261 | + |
| 262 | +## Future Considerations |
| 263 | + |
| 264 | +- **Control plane deletion**: When a project control plane is deleted, indexed |
| 265 | + resources from that cluster must be cleaned up. Ideally, the platform emits |
| 266 | + deletion events for all resources before the control plane is removed, |
| 267 | + allowing event-driven cleanup. If this isn't guaranteed, the indexer may need |
| 268 | + to track source cluster metadata and delete documents when a cluster is |
| 269 | + disengaged. |
| 270 | +- **Dead letter handling**: Route persistently failing events to a dead letter |
| 271 | + queue for manual inspection |
| 272 | +- **Metrics and observability**: Expose indexing lag, throughput, and error |
| 273 | + rates via Prometheus |
| 274 | +- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing |
| 275 | +- **Policy-based sharding**: For very large deployments, assign subsets of |
| 276 | + policies to instances using consistent hashing |
0 commit comments