Skip to content

Commit b0bc320

Browse files
committed
feat: define resource indexer architecture
Creates a new architecture document that goes into detail on the design of the indexing service.
1 parent 557854a commit b0bc320

File tree

5 files changed

+354
-1
lines changed

5 files changed

+354
-1
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,10 @@ using CEL-based filtering. The service integrates natively with kubectl/RBAC and
88
targets Meilisearch as the search backend.
99

1010
![](./docs/diagrams/SearchServiceContext.png)
11+
12+
## Documentation
13+
14+
- [Architecture](./docs/architecture.md) — High-level design and component
15+
overview
16+
- [Resource Indexer](./docs/components/resource-indexer.md) — Detailed design
17+
for the indexing component

docs/architecture.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,13 @@ using powerful indexing and real-time event processing.
4848
- Filter events based on active index policies
4949
- Evaluate [CEL expressions][CEL] for conditional indexing
5050
- Extract and transform resource data for indexing
51-
- Write to index backend with proper error handling and retries
51+
- Manage documents in index backend with proper error handling and retries
5252
- Manage index lifecycle (creation, updates, deletion)
5353
- Bootstrap indexes from existing state
5454

55+
See the [Resource Indexer Architecture](./components/resource-indexer.md) for
56+
detailed design documentation.
57+
5558
### Controller Manager
5659

5760
**Purpose**: Manages and validates resources for the search service
Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
<!-- omit from toc -->
2+
# Resource Indexer Architecture
3+
4+
- [Overview](#overview)
5+
- [Design Goals](#design-goals)
6+
- [Core Responsibilities](#core-responsibilities)
7+
- [Event Processing Flow](#event-processing-flow)
8+
- [Event Consumption](#event-consumption)
9+
- [Horizontal Scaling](#horizontal-scaling)
10+
- [Kubernetes Configuration](#kubernetes-configuration)
11+
- [Policy Management](#policy-management)
12+
- [CEL Compilation](#cel-compilation)
13+
- [Document Lifecycle](#document-lifecycle)
14+
- [Transformation](#transformation)
15+
- [Persistence and Acknowledgment](#persistence-and-acknowledgment)
16+
- [Batching](#batching)
17+
- [Duplicate Handling](#duplicate-handling)
18+
- [Bootstrap Process](#bootstrap-process)
19+
- [Multi-Cluster Bootstrap](#multi-cluster-bootstrap)
20+
- [Error Handling](#error-handling)
21+
- [Integration Points](#integration-points)
22+
- [Future Considerations](#future-considerations)
23+
24+
25+
## Overview
26+
27+
The Resource Indexer is a core component of the Search service responsible for
28+
maintaining a searchable index of platform resources. It consumes audit log
29+
events from NATS JetStream, applies policy-based filtering, and manages indexed
30+
documents in the search backend.
31+
32+
## Design Goals
33+
34+
- **Real-time indexing**: Process resource changes within seconds of occurrence
35+
- **Policy-driven**: Index only resources matching active IndexPolicy
36+
configurations
37+
- **Reliable delivery**: Guarantee at-least-once processing of all events
38+
- **Graceful recovery**: Resume processing from last known position after
39+
restarts
40+
- **Horizontal scalability**: Scale throughput by adding instances without
41+
coordination
42+
43+
## Core Responsibilities
44+
45+
The Resource Indexer handles:
46+
47+
- Consuming audit log events from NATS JetStream
48+
- Watching IndexPolicy resources and evaluating CEL filters
49+
- Transforming Kubernetes resources into searchable documents
50+
- Persisting documents to the index backend
51+
- Acknowledging events only after successful persistence
52+
53+
### Event Processing Flow
54+
55+
The following diagram illustrates how the indexer processes events, including
56+
policy matching, batching, and acknowledgment handling:
57+
58+
```mermaid
59+
sequenceDiagram
60+
participant JS as NATS JetStream
61+
participant Indexer as Resource Indexer
62+
participant Cache as Policy Cache
63+
participant Meili as Meilisearch
64+
65+
rect rgb(240, 248, 255)
66+
note right of JS: Create/Update Matches Policy
67+
JS->>Indexer: Deliver audit event
68+
Indexer->>Cache: Evaluate policies
69+
Cache-->>Indexer: Policy match + compiled CEL
70+
Indexer->>Indexer: Evaluate CEL filter
71+
Indexer->>Indexer: Transform resource to document
72+
Indexer->>Indexer: Queue document for upsert
73+
74+
alt Upsert queue ready (size or time threshold)
75+
Indexer->>Meili: Add/replace documents batch
76+
Meili-->>Indexer: Success
77+
Indexer->>JS: Ack all events in batch
78+
end
79+
end
80+
81+
rect rgb(245, 245, 245)
82+
note right of JS: Create Does Not Match Policy
83+
JS->>Indexer: Deliver audit event (create)
84+
Indexer->>Cache: Evaluate policies
85+
Cache-->>Indexer: No matching policy
86+
Indexer->>JS: Ack (never indexed)
87+
end
88+
89+
rect rgb(255, 248, 240)
90+
note right of JS: Update No Longer Matches Policy
91+
JS->>Indexer: Deliver audit event (update)
92+
Indexer->>Cache: Evaluate policies
93+
Cache-->>Indexer: No matching policy
94+
Indexer->>Indexer: Queue delete by UID
95+
96+
alt Delete queue ready (size or time threshold)
97+
Indexer->>Meili: Delete documents batch
98+
Meili-->>Indexer: Success
99+
Indexer->>JS: Ack all events in batch
100+
end
101+
end
102+
103+
rect rgb(255, 245, 238)
104+
note right of JS: Resource Deleted
105+
JS->>Indexer: Deliver audit event (delete)
106+
Indexer->>Indexer: Queue delete by UID
107+
108+
alt Delete queue ready (size or time threshold)
109+
Indexer->>Meili: Delete documents batch
110+
Meili-->>Indexer: Success
111+
Indexer->>JS: Ack all events in batch
112+
end
113+
end
114+
115+
rect rgb(255, 240, 240)
116+
note right of JS: Persistence Failure
117+
JS->>Indexer: Deliver audit event
118+
Indexer->>Indexer: Transform and queue
119+
Indexer->>Meili: Flush queue
120+
Meili-->>Indexer: Error
121+
note right of Indexer: Do not ack — JetStream<br/>redelivers after timeout
122+
end
123+
```
124+
125+
## Event Consumption
126+
127+
The indexer consumes audit log events from [NATS JetStream][jetstream] using
128+
[durable consumers][durable-consumers]. JetStream provides:
129+
130+
- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts
131+
- **Position tracking**: Durable consumers track acknowledged messages; on
132+
restart, consumption resumes from the last acknowledged position
133+
- **Backpressure**: Pull-based consumption allows the indexer to control its
134+
processing rate
135+
136+
[jetstream]: https://docs.nats.io/nats-concepts/jetstream
137+
[durable-consumers]: https://docs.nats.io/nats-concepts/jetstream/consumers#durable-consumers
138+
139+
### Horizontal Scaling
140+
141+
The indexer uses JetStream [queue groups] for horizontal scaling. When multiple
142+
instances join the same queue group, JetStream distributes messages across them
143+
automatically — each message is delivered to exactly one instance.
144+
145+
<p align="center">
146+
<img src="../diagrams/ResourceIndexerScaling.png" alt="Resource Indexer horizontal scaling diagram">
147+
</p>
148+
149+
This enables linear throughput scaling without coordination between instances.
150+
151+
[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue
152+
153+
### Kubernetes Configuration
154+
155+
In Kubernetes environments, JetStream resources (Streams, Consumers) can be
156+
managed declaratively using [NACK] (NATS Controllers for Kubernetes). NACK
157+
provides CRDs for defining Streams and Consumers as Kubernetes resources,
158+
enabling GitOps workflows and consistent configuration across environments.
159+
160+
[NACK]: https://github.com/nats-io/nack
161+
162+
## Policy Management
163+
164+
IndexPolicy resources define what to index. The indexer watches these resources
165+
using a Kubernetes [informer], which provides:
166+
167+
- **List-watch semantics**: Initial list of all policies followed by a watch
168+
stream for changes
169+
- **Local cache**: In-memory store for fast lookups during event processing
170+
- **Automatic resync**: Periodic re-list to correct any drift
171+
172+
Each indexer instance maintains its own policy cache. Since events can be routed
173+
to any instance (via queue groups), each instance caches all policies.
174+
IndexPolicy resources are typically small and few in number, so this
175+
replication is acceptable.
176+
177+
### CEL Compilation
178+
179+
[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid
180+
recompilation on every event, compile expressions when policies are added or
181+
updated and cache the compiled programs alongside the policy.
182+
183+
The indexer should wait for the informer cache to sync before processing events
184+
to ensure all active policies are available for matching.
185+
186+
[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer
187+
[CEL]: https://cel.dev
188+
189+
## Document Lifecycle
190+
191+
The indexer manages documents in the search index based on audit events:
192+
193+
| Event Type | Policy Match | Action |
194+
|------------|--------------|--------|
195+
| Create | Yes | Upsert document |
196+
| Create | No | Acknowledge only (never indexed) |
197+
| Update | Yes | Upsert document |
198+
| Update | No | Delete document (may have been indexed) |
199+
| Delete || Delete document |
200+
201+
When a resource is updated and no longer matches any policy (e.g., labels
202+
changed, CEL filter no longer passes), the indexer queues a delete. Since the
203+
indexer doesn't track what was previously indexed, it always attempts deletion
204+
for non-matching updates — Meilisearch treats deletes of non-existent documents
205+
as no-ops.
206+
207+
### Transformation
208+
209+
When an event matches a policy, the indexer transforms the Kubernetes resource
210+
into a searchable document:
211+
212+
- Extract fields specified in the IndexPolicy field mappings
213+
- Normalize metadata (labels, annotations) into searchable formats
214+
- Use the resource's UID as the document identifier
215+
216+
## Persistence and Acknowledgment
217+
218+
Documents are persisted to the index backend ([Meilisearch]). To guarantee
219+
at-least-once delivery, events are only acknowledged after successful
220+
persistence.
221+
222+
[Meilisearch]: https://www.meilisearch.com/docs
223+
224+
### Batching
225+
226+
For efficiency, batch multiple operations before persisting. The indexer
227+
maintains separate queues for upserts and deletes since Meilisearch requires
228+
separate API calls for each operation type. When either queue reaches its size
229+
or time threshold:
230+
231+
1. Flush pending upserts via the [add documents][meilisearch-add-documents]
232+
endpoint
233+
2. Flush pending deletes via the [delete documents][meilisearch-delete-documents]
234+
endpoint
235+
3. On success, acknowledge all events whose operations were flushed
236+
4. On failure, do not acknowledge — JetStream redelivers after ack timeout
237+
238+
Create events that don't match any policy can be acknowledged immediately since
239+
the resource was never indexed. Update and delete events that don't match should
240+
still queue a delete operation — the indexer cannot know whether the resource
241+
was previously indexed, and Meilisearch delete operations are idempotent.
242+
243+
[meilisearch-add-documents]: https://www.meilisearch.com/docs/reference/api/documents#add-or-replace-documents
244+
[meilisearch-delete-documents]: https://www.meilisearch.com/docs/reference/api/documents#delete-documents-by-batch
245+
246+
### Duplicate Handling
247+
248+
At-least-once delivery means duplicates are possible (e.g., after a failure
249+
before acknowledgment). The index backend handles this via [document primary
250+
keys][meilisearch-primary-key] — reindexing the same resource overwrites the
251+
existing document.
252+
253+
[meilisearch-primary-key]: https://www.meilisearch.com/docs/learn/core_concepts/primary_key
254+
255+
## Bootstrap Process
256+
257+
On startup or when a new IndexPolicy is created, the indexer must populate the
258+
index with existing resources. The platform spans multiple project control
259+
planes, so bootstrap must list resources from each cluster.
260+
261+
### Multi-Cluster Bootstrap
262+
263+
The indexer uses the [multicluster-runtime] provider pattern to discover
264+
project control planes. For each discovered cluster:
265+
266+
1. List resources matching the policy selector from that cluster's API
267+
2. Transform and index each resource
268+
3. Handle concurrent modifications during bootstrap gracefully
269+
270+
The provider handles dynamic cluster discovery — as clusters come online or go
271+
offline, the indexer bootstraps or cleans up accordingly.
272+
273+
After bootstrap completes, real-time indexing continues via the JetStream event
274+
stream, which already aggregates events from all control planes.
275+
276+
[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime
277+
278+
## Error Handling
279+
280+
- **Transient failures**: Retry with exponential backoff for network errors and
281+
temporary unavailability
282+
- **Malformed events**: Log and skip events that cannot be parsed; acknowledge
283+
to prevent redelivery loops
284+
- **Backend unavailability**: Buffer events in memory (bounded) while attempting
285+
reconnection; pause consumption if buffer fills
286+
- **Policy evaluation errors**: Log and skip events with CEL evaluation
287+
failures; do not block processing of other events
288+
289+
## Integration Points
290+
291+
| System | Protocol | Purpose |
292+
|--------|----------|---------|
293+
| [NATS JetStream][jetstream] | NATS | Consume audit log events (aggregated from all clusters) |
294+
| Search API Server | HTTPS | Watch IndexPolicy resources |
295+
| Project Control Planes | HTTPS | Bootstrap existing resources |
296+
| [Meilisearch] | HTTPS/JSON | Manage indexed documents |
297+
298+
## Future Considerations
299+
300+
- **Control plane deletion**: When a project control plane is deleted, indexed
301+
resources from that cluster must be cleaned up. Ideally, the platform emits
302+
deletion events for all resources before the control plane is removed,
303+
allowing event-driven cleanup. If this isn't guaranteed, the indexer may need
304+
to track source cluster metadata and delete documents when a cluster is
305+
disengaged.
306+
- **Dead letter handling**: Route persistently failing events to a dead letter
307+
queue for manual inspection
308+
- **Metrics and observability**: Expose indexing lag, throughput, and error
309+
rates via Prometheus
310+
- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing
311+
- **Policy-based sharding**: For very large deployments, assign subsets of
312+
policies to instances using consistent hashing
24.8 KB
Loading
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
@startuml ResourceIndexerScaling
2+
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Container.puml
3+
4+
System_Ext(jetstream, "NATS JetStream", "Audit log event stream aggregated from all control planes")
5+
6+
System_Boundary(indexerGroup, "Queue Group: resource-indexer") {
7+
Container(indexer1, "Indexer #1", "Go", "Policy matching, CEL filtering, batch writes")
8+
Container(indexer2, "Indexer #2", "Go", "Policy matching, CEL filtering, batch writes")
9+
Container(indexer3, "Indexer #3", "Go", "Policy matching, CEL filtering, batch writes")
10+
}
11+
12+
note right of indexerGroup
13+
Each event delivered to
14+
exactly one instance —
15+
scales linearly without
16+
coordination
17+
end note
18+
19+
System_Ext(meilisearch, "Meilisearch", "Persistent search index")
20+
21+
Rel_D(jetstream, indexer1, "Distributes event", "NATS")
22+
Rel_D(jetstream, indexer2, "Distributes event", "NATS")
23+
Rel_D(jetstream, indexer3, "Distributes event", "NATS")
24+
25+
Rel_D(indexer1, meilisearch, "Batched upserts/deletes", "HTTPS")
26+
Rel_D(indexer2, meilisearch, "Batched upserts/deletes", "HTTPS")
27+
Rel_D(indexer3, meilisearch, "Batched upserts/deletes", "HTTPS")
28+
29+
SHOW_LEGEND()
30+
31+
@enduml

0 commit comments

Comments
 (0)