Skip to content

Commit bac97bd

Browse files
committed
feat: define resource indexer architecture
Creates a new architecture document that goes into detail on the design of the indexing service.
1 parent 557854a commit bac97bd

File tree

5 files changed

+310
-0
lines changed

5 files changed

+310
-0
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,10 @@ using CEL-based filtering. The service integrates natively with kubectl/RBAC and
88
targets Meilisearch as the search backend.
99

1010
![](./docs/diagrams/SearchServiceContext.png)
11+
12+
## Documentation
13+
14+
- [Architecture](./docs/architecture.md) — High-level design and component
15+
overview
16+
- [Resource Indexer](./docs/components/resource-indexer.md) — Detailed design
17+
for the indexing component

docs/architecture.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ using powerful indexing and real-time event processing.
5252
- Manage index lifecycle (creation, updates, deletion)
5353
- Bootstrap indexes from existing state
5454

55+
See the [Resource Indexer Architecture](./components/resource-indexer.md) for
56+
detailed design documentation.
57+
5558
### Controller Manager
5659

5760
**Purpose**: Manages and validates resources for the search service
Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
<!-- omit from toc -->
2+
# Resource Indexer Architecture
3+
4+
- [Overview](#overview)
5+
- [Design Goals](#design-goals)
6+
- [Core Responsibilities](#core-responsibilities)
7+
- [Event Consumption](#event-consumption)
8+
- [Horizontal Scaling](#horizontal-scaling)
9+
- [Policy Management](#policy-management)
10+
- [CEL Compilation](#cel-compilation)
11+
- [Document Transformation](#document-transformation)
12+
- [Persistence and Acknowledgment](#persistence-and-acknowledgment)
13+
- [Batching](#batching)
14+
- [Duplicate Handling](#duplicate-handling)
15+
- [Bootstrap Process](#bootstrap-process)
16+
- [Multi-Cluster Bootstrap](#multi-cluster-bootstrap)
17+
- [Error Handling](#error-handling)
18+
- [Integration Points](#integration-points)
19+
- [Future Considerations](#future-considerations)
20+
21+
22+
## Overview
23+
24+
The Resource Indexer is a core component of the Search service responsible for
25+
maintaining a searchable index of platform resources. It consumes audit log
26+
events from NATS JetStream, applies policy-based filtering, and writes indexed
27+
documents to the search backend.
28+
29+
## Design Goals
30+
31+
- **Real-time indexing**: Process resource changes within seconds of occurrence
32+
- **Policy-driven**: Index only resources matching active IndexPolicy
33+
configurations
34+
- **Reliable delivery**: Guarantee at-least-once processing of all events
35+
- **Graceful recovery**: Resume processing from last known position after
36+
restarts
37+
- **Horizontal scalability**: Scale throughput by adding instances without
38+
coordination
39+
- **Minimal resource footprint**: Operate efficiently within constrained
40+
environments
41+
42+
## Core Responsibilities
43+
44+
The Resource Indexer handles:
45+
46+
- Consuming audit log events from NATS JetStream
47+
- Watching IndexPolicy resources and evaluating CEL filters
48+
- Transforming Kubernetes resources into searchable documents
49+
- Persisting documents to the index backend
50+
- Acknowledging events only after successful persistence
51+
52+
### Event Processing Flow
53+
54+
The following diagram illustrates how the indexer processes events, including
55+
policy matching, batching, and acknowledgment handling:
56+
57+
```mermaid
58+
sequenceDiagram
59+
participant JS as NATS JetStream
60+
participant Indexer as Resource Indexer
61+
participant Cache as Policy Cache
62+
participant Meili as Meilisearch
63+
64+
rect rgb(240, 248, 255)
65+
note right of JS: Create/Update Matches Policy
66+
JS->>Indexer: Deliver audit event
67+
Indexer->>Cache: Evaluate policies
68+
Cache-->>Indexer: Policy match + compiled CEL
69+
Indexer->>Indexer: Evaluate CEL filter
70+
Indexer->>Indexer: Transform resource to document
71+
Indexer->>Indexer: Add upsert to batch
72+
73+
alt Batch ready (size or time threshold)
74+
Indexer->>Meili: Persist document batch
75+
Meili-->>Indexer: Success
76+
Indexer->>JS: Ack all events in batch
77+
end
78+
end
79+
80+
rect rgb(255, 248, 240)
81+
note right of JS: Update No Longer Matches Policy
82+
JS->>Indexer: Deliver audit event (update)
83+
Indexer->>Cache: Evaluate policies
84+
Cache-->>Indexer: No matching policy
85+
Indexer->>Indexer: Add delete to batch
86+
Indexer->>Meili: Delete document by UID
87+
Meili-->>Indexer: Success
88+
Indexer->>JS: Ack
89+
end
90+
91+
rect rgb(255, 245, 238)
92+
note right of JS: Resource Deleted
93+
JS->>Indexer: Deliver audit event (delete)
94+
Indexer->>Indexer: Add delete to batch
95+
Indexer->>Meili: Delete document by UID
96+
Meili-->>Indexer: Success
97+
Indexer->>JS: Ack
98+
end
99+
100+
rect rgb(255, 240, 240)
101+
note right of JS: Persistence Failure
102+
JS->>Indexer: Deliver audit event
103+
Indexer->>Indexer: Transform and batch
104+
Indexer->>Meili: Persist document batch
105+
Meili-->>Indexer: Error
106+
note right of Indexer: Do not ack — JetStream<br/>redelivers after timeout
107+
end
108+
```
109+
110+
## Event Consumption
111+
112+
The indexer consumes audit log events from [NATS JetStream][jetstream] using
113+
[durable consumers][durable-consumers]. JetStream provides:
114+
115+
- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts
116+
- **Position tracking**: Durable consumers track acknowledged messages; on
117+
restart, consumption resumes from the last acknowledged position
118+
- **Backpressure**: Pull-based consumption allows the indexer to control its
119+
processing rate
120+
121+
[jetstream]: https://docs.nats.io/nats-concepts/jetstream
122+
[durable-consumers]: https://docs.nats.io/nats-concepts/jetstream/consumers#durable-consumers
123+
124+
### Horizontal Scaling
125+
126+
The indexer uses JetStream [queue groups] for horizontal scaling. When multiple
127+
instances join the same queue group, JetStream distributes messages across them
128+
automatically — each message is delivered to exactly one instance.
129+
130+
<p align="center">
131+
<img src="../diagrams/ResourceIndexerScaling.png" alt="Resource Indexer horizontal scaling diagram">
132+
</p>
133+
134+
This enables linear throughput scaling without coordination between instances.
135+
136+
[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue
137+
138+
## Policy Management
139+
140+
IndexPolicy resources define what to index. The indexer watches these resources
141+
using a Kubernetes [informer], which provides:
142+
143+
- **List-watch semantics**: Initial list of all policies followed by a watch
144+
stream for changes
145+
- **Local cache**: In-memory store for fast lookups during event processing
146+
- **Automatic resync**: Periodic re-list to correct any drift
147+
148+
Each indexer instance maintains its own policy cache. Since events can be routed
149+
to any instance (via queue groups), each instance caches all policies.
150+
IndexPolicy resources are typically small and few in number, so this
151+
replication is acceptable.
152+
153+
### CEL Compilation
154+
155+
[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid
156+
recompilation on every event, compile expressions when policies are added or
157+
updated and cache the compiled programs alongside the policy.
158+
159+
The indexer should wait for the informer cache to sync before processing events
160+
to ensure all active policies are available for matching.
161+
162+
[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer
163+
[CEL]: https://cel.dev
164+
165+
## Document Lifecycle
166+
167+
The indexer manages documents in the search index based on audit events:
168+
169+
| Event Type | Policy Match | Action |
170+
|------------|--------------|--------|
171+
| Create | Yes | Upsert document |
172+
| Update | Yes | Upsert document |
173+
| Update | No | Delete document (was previously indexed) |
174+
| Delete || Delete document |
175+
176+
When a resource is updated and no longer matches any policy (e.g., labels
177+
changed, CEL filter no longer passes), the indexer deletes the document from the
178+
index. This ensures the index doesn't accumulate stale documents for resources
179+
that no longer meet indexing criteria.
180+
181+
### Transformation
182+
183+
When an event matches a policy, the indexer transforms the Kubernetes resource
184+
into a searchable document:
185+
186+
- Extract fields specified in the IndexPolicy field mappings
187+
- Normalize metadata (labels, annotations) into searchable formats
188+
- Use the resource's UID as the document identifier
189+
190+
## Persistence and Acknowledgment
191+
192+
Documents are persisted to the index backend ([Meilisearch]). To guarantee
193+
at-least-once delivery, events are only acknowledged after successful
194+
persistence.
195+
196+
[Meilisearch]: https://www.meilisearch.com/docs
197+
198+
### Batching
199+
200+
For efficiency, batch multiple documents into a single write request. When a
201+
batch completes:
202+
203+
1. Persist all documents to the index backend
204+
2. On success, acknowledge all events in the batch
205+
3. On failure, do not acknowledge — JetStream redelivers after ack timeout
206+
207+
Events that don't match any policy should be acknowledged immediately to prevent
208+
reprocessing.
209+
210+
### Duplicate Handling
211+
212+
At-least-once delivery means duplicates are possible (e.g., after a failure
213+
before acknowledgment). The index backend handles this via [document primary
214+
keys][meilisearch-primary-key] — reindexing the same resource overwrites the
215+
existing document.
216+
217+
[meilisearch-primary-key]: https://www.meilisearch.com/docs/learn/core_concepts/primary_key
218+
219+
## Bootstrap Process
220+
221+
On startup or when a new IndexPolicy is created, the indexer must populate the
222+
index with existing resources. The platform spans multiple project control
223+
planes, so bootstrap must list resources from each cluster.
224+
225+
### Multi-Cluster Bootstrap
226+
227+
The indexer uses the [multicluster-runtime] provider pattern to discover
228+
project control planes. For each discovered cluster:
229+
230+
1. List resources matching the policy selector from that cluster's API
231+
2. Transform and index each resource
232+
3. Handle concurrent modifications during bootstrap gracefully
233+
234+
The provider handles dynamic cluster discovery — as clusters come online or go
235+
offline, the indexer bootstraps or cleans up accordingly.
236+
237+
After bootstrap completes, real-time indexing continues via the JetStream event
238+
stream, which already aggregates events from all control planes.
239+
240+
[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime
241+
242+
## Error Handling
243+
244+
- **Transient failures**: Retry with exponential backoff for network errors and
245+
temporary unavailability
246+
- **Malformed events**: Log and skip events that cannot be parsed; acknowledge
247+
to prevent redelivery loops
248+
- **Backend unavailability**: Buffer events in memory (bounded) while attempting
249+
reconnection; pause consumption if buffer fills
250+
- **Policy evaluation errors**: Log and skip events with CEL evaluation
251+
failures; do not block processing of other events
252+
253+
## Integration Points
254+
255+
| System | Protocol | Purpose |
256+
|--------|----------|---------|
257+
| [NATS JetStream][jetstream] | NATS | Consume audit log events (aggregated from all clusters) |
258+
| Search API Server | HTTPS | Watch IndexPolicy resources |
259+
| Project Control Planes | HTTPS | Bootstrap existing resources |
260+
| [Meilisearch] | HTTPS/JSON | Persist indexed documents |
261+
262+
## Future Considerations
263+
264+
- **Control plane deletion**: When a project control plane is deleted, indexed
265+
resources from that cluster must be cleaned up. Ideally, the platform emits
266+
deletion events for all resources before the control plane is removed,
267+
allowing event-driven cleanup. If this isn't guaranteed, the indexer may need
268+
to track source cluster metadata and delete documents when a cluster is
269+
disengaged.
270+
- **Dead letter handling**: Route persistently failing events to a dead letter
271+
queue for manual inspection
272+
- **Metrics and observability**: Expose indexing lag, throughput, and error
273+
rates via Prometheus
274+
- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing
275+
- **Policy-based sharding**: For very large deployments, assign subsets of
276+
policies to instances using consistent hashing
19.6 KB
Loading
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
@startuml ResourceIndexerScaling
2+
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Container.puml
3+
4+
System_Ext(jetstream, "NATS JetStream", "Audit log event stream aggregated from all control planes")
5+
6+
System_Boundary(indexerGroup, "Queue Group: resource-indexer") {
7+
Container(indexer1, "Indexer #1", "Go", "Processes subset of events")
8+
Container(indexer2, "Indexer #2", "Go", "Processes subset of events")
9+
Container(indexer3, "Indexer #3", "Go", "Processes subset of events")
10+
}
11+
12+
System_Ext(meilisearch, "Meilisearch", "Search index backend")
13+
14+
Rel_D(jetstream, indexer1, "Delivers event", "NATS")
15+
Rel_D(jetstream, indexer2, "Delivers event", "NATS")
16+
Rel_D(jetstream, indexer3, "Delivers event", "NATS")
17+
18+
Rel_D(indexer1, meilisearch, "Writes documents", "HTTPS")
19+
Rel_D(indexer2, meilisearch, "Writes documents", "HTTPS")
20+
Rel_D(indexer3, meilisearch, "Writes documents", "HTTPS")
21+
22+
SHOW_LEGEND()
23+
24+
@enduml

0 commit comments

Comments
 (0)