|
| 1 | +<!-- omit from toc --> |
| 2 | +# Resource Indexer Architecture |
| 3 | + |
| 4 | +- [Overview](#overview) |
| 5 | +- [Design Goals](#design-goals) |
| 6 | +- [Core Responsibilities](#core-responsibilities) |
| 7 | +- [Event Consumption](#event-consumption) |
| 8 | + - [Horizontal Scaling](#horizontal-scaling) |
| 9 | +- [Policy Management](#policy-management) |
| 10 | + - [CEL Compilation](#cel-compilation) |
| 11 | +- [Document Transformation](#document-transformation) |
| 12 | +- [Persistence and Acknowledgment](#persistence-and-acknowledgment) |
| 13 | + - [Batching](#batching) |
| 14 | + - [Duplicate Handling](#duplicate-handling) |
| 15 | +- [Bootstrap Process](#bootstrap-process) |
| 16 | +- [Error Handling](#error-handling) |
| 17 | +- [Integration Points](#integration-points) |
| 18 | +- [Future Considerations](#future-considerations) |
| 19 | + |
| 20 | + |
| 21 | +## Overview |
| 22 | + |
| 23 | +The Resource Indexer is a core component of the Search service responsible for |
| 24 | +maintaining a searchable index of platform resources. It consumes audit log |
| 25 | +events from NATS JetStream, applies policy-based filtering, and writes indexed |
| 26 | +documents to the search backend. |
| 27 | + |
| 28 | +## Design Goals |
| 29 | + |
| 30 | +- **Real-time indexing**: Process resource changes within seconds of occurrence |
| 31 | +- **Policy-driven**: Index only resources matching active IndexPolicy |
| 32 | + configurations |
| 33 | +- **Reliable delivery**: Guarantee at-least-once processing of all events |
| 34 | +- **Graceful recovery**: Resume processing from last known position after |
| 35 | + restarts |
| 36 | +- **Minimal resource footprint**: Operate efficiently within constrained |
| 37 | + environments |
| 38 | + |
| 39 | +## Core Responsibilities |
| 40 | + |
| 41 | +The Resource Indexer handles: |
| 42 | + |
| 43 | +- Consuming audit log events from NATS JetStream |
| 44 | +- Watching IndexPolicy resources and evaluating CEL filters |
| 45 | +- Transforming Kubernetes resources into searchable documents |
| 46 | +- Persisting documents to the index backend |
| 47 | +- Acknowledging events only after successful persistence |
| 48 | + |
| 49 | +## Event Consumption |
| 50 | + |
| 51 | +The indexer consumes audit log events from NATS JetStream using durable |
| 52 | +consumers. JetStream provides: |
| 53 | + |
| 54 | +- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts |
| 55 | +- **Position tracking**: Durable consumers track acknowledged messages; on |
| 56 | + restart, consumption resumes from the last acknowledged position |
| 57 | +- **Backpressure**: Pull-based consumption allows the indexer to control its |
| 58 | + processing rate |
| 59 | + |
| 60 | +### Horizontal Scaling |
| 61 | + |
| 62 | +The indexer uses JetStream [queue groups] for horizontal scaling. When multiple |
| 63 | +instances join the same queue group, JetStream distributes messages across them |
| 64 | +automatically — each message is delivered to exactly one instance. |
| 65 | + |
| 66 | +``` |
| 67 | + Queue Group: "resource-indexer" |
| 68 | + │ |
| 69 | + ┌───────────────────────┼───────────────────────┐ |
| 70 | + │ │ │ |
| 71 | + ▼ ▼ ▼ |
| 72 | + ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ |
| 73 | + │ Indexer #1 │ │ Indexer #2 │ │ Indexer #3 │ |
| 74 | + └──────────────┘ └──────────────┘ └──────────────┘ |
| 75 | +``` |
| 76 | + |
| 77 | +This enables linear throughput scaling without coordination between instances. |
| 78 | + |
| 79 | +[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue |
| 80 | + |
| 81 | +## Policy Management |
| 82 | + |
| 83 | +IndexPolicy resources define what to index. The indexer watches these resources |
| 84 | +using a Kubernetes [informer], which provides: |
| 85 | + |
| 86 | +- **List-watch semantics**: Initial list of all policies followed by a watch |
| 87 | + stream for changes |
| 88 | +- **Local cache**: In-memory store for fast lookups during event processing |
| 89 | +- **Automatic resync**: Periodic re-list to correct any drift |
| 90 | + |
| 91 | +Each indexer instance maintains its own policy cache. Since events can be routed |
| 92 | +to any instance (via queue groups), each instance caches all policies. |
| 93 | +IndexPolicy resources are typically small and few in number, so this |
| 94 | +replication is acceptable. |
| 95 | + |
| 96 | +### CEL Compilation |
| 97 | + |
| 98 | +[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid |
| 99 | +recompilation on every event, compile expressions when policies are added or |
| 100 | +updated and cache the compiled programs alongside the policy. |
| 101 | + |
| 102 | +The indexer should wait for the informer cache to sync before processing events |
| 103 | +to ensure all active policies are available for matching. |
| 104 | + |
| 105 | +[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer |
| 106 | +[CEL]: https://cel.dev |
| 107 | + |
| 108 | +## Document Transformation |
| 109 | + |
| 110 | +When an event matches a policy, the indexer transforms the Kubernetes resource |
| 111 | +into a searchable document: |
| 112 | + |
| 113 | +- Extract fields specified in the IndexPolicy field mappings |
| 114 | +- Normalize metadata (labels, annotations) into searchable formats |
| 115 | +- Use the resource's UID as the document identifier |
| 116 | + |
| 117 | +## Persistence and Acknowledgment |
| 118 | + |
| 119 | +Documents are persisted to the index backend (Meilisearch). To guarantee |
| 120 | +at-least-once delivery, events are only acknowledged after successful |
| 121 | +persistence. |
| 122 | + |
| 123 | +### Batching |
| 124 | + |
| 125 | +For efficiency, batch multiple documents into a single write request. When a |
| 126 | +batch completes: |
| 127 | + |
| 128 | +1. Persist all documents to the index backend |
| 129 | +2. On success, acknowledge all events in the batch |
| 130 | +3. On failure, do not acknowledge — JetStream redelivers after ack timeout |
| 131 | + |
| 132 | +Events that don't match any policy should be acknowledged immediately to prevent |
| 133 | +reprocessing. |
| 134 | + |
| 135 | +### Duplicate Handling |
| 136 | + |
| 137 | +At-least-once delivery means duplicates are possible (e.g., after a failure |
| 138 | +before acknowledgment). The index backend handles this via document ID upserts — |
| 139 | +reindexing the same resource overwrites the existing document. |
| 140 | + |
| 141 | +## Bootstrap Process |
| 142 | + |
| 143 | +On startup or when a new IndexPolicy is created, the indexer must populate the |
| 144 | +index with existing resources. The platform spans multiple project control |
| 145 | +planes, so bootstrap must list resources from each cluster. |
| 146 | + |
| 147 | +### Multi-Cluster Bootstrap |
| 148 | + |
| 149 | +The indexer uses the [multicluster-runtime] provider pattern to discover |
| 150 | +project control planes. For each discovered cluster: |
| 151 | + |
| 152 | +1. List resources matching the policy selector from that cluster's API |
| 153 | +2. Transform and index each resource |
| 154 | +3. Handle concurrent modifications during bootstrap gracefully |
| 155 | + |
| 156 | +The provider handles dynamic cluster discovery — as clusters come online or go |
| 157 | +offline, the indexer bootstraps or cleans up accordingly. |
| 158 | + |
| 159 | +After bootstrap completes, real-time indexing continues via the JetStream event |
| 160 | +stream, which already aggregates events from all control planes. |
| 161 | + |
| 162 | +[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime |
| 163 | + |
| 164 | +## Error Handling |
| 165 | + |
| 166 | +- **Transient failures**: Retry with exponential backoff for network errors and |
| 167 | + temporary unavailability |
| 168 | +- **Malformed events**: Log and skip events that cannot be parsed; acknowledge |
| 169 | + to prevent redelivery loops |
| 170 | +- **Backend unavailability**: Buffer events in memory (bounded) while attempting |
| 171 | + reconnection; pause consumption if buffer fills |
| 172 | +- **Policy evaluation errors**: Log and skip events with CEL evaluation |
| 173 | + failures; do not block processing of other events |
| 174 | + |
| 175 | +## Integration Points |
| 176 | + |
| 177 | +| System | Protocol | Purpose | |
| 178 | +|--------|----------|---------| |
| 179 | +| NATS JetStream | NATS | Consume audit log events (aggregated from all clusters) | |
| 180 | +| Search API Server | HTTPS | Watch IndexPolicy resources | |
| 181 | +| Project Control Planes | HTTPS | Bootstrap existing resources | |
| 182 | +| Meilisearch | HTTPS/JSON | Persist indexed documents | |
| 183 | + |
| 184 | +## Future Considerations |
| 185 | + |
| 186 | +- **Control plane deletion**: When a project control plane is deleted, indexed |
| 187 | + resources from that cluster must be cleaned up. Ideally, the platform emits |
| 188 | + deletion events for all resources before the control plane is removed, |
| 189 | + allowing event-driven cleanup. If this isn't guaranteed, the indexer may need |
| 190 | + to track source cluster metadata and delete documents when a cluster is |
| 191 | + disengaged. |
| 192 | +- **Dead letter handling**: Route persistently failing events to a dead letter |
| 193 | + queue for manual inspection |
| 194 | +- **Metrics and observability**: Expose indexing lag, throughput, and error |
| 195 | + rates via Prometheus |
| 196 | +- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing |
| 197 | +- **Policy-based sharding**: For very large deployments, assign subsets of |
| 198 | + policies to instances using consistent hashing |
0 commit comments