Skip to content

Commit 1fba80f

Browse files
committed
feat: define resource indexer architecture
Creates a new architecture document that goes into detail on the design of the indexing service.
1 parent 557854a commit 1fba80f

File tree

5 files changed

+282
-0
lines changed

5 files changed

+282
-0
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,10 @@ using CEL-based filtering. The service integrates natively with kubectl/RBAC and
88
targets Meilisearch as the search backend.
99

1010
![](./docs/diagrams/SearchServiceContext.png)
11+
12+
## Documentation
13+
14+
- [Architecture](./docs/architecture.md) — High-level design and component
15+
overview
16+
- [Resource Indexer](./docs/components/resource-indexer.md) — Detailed design
17+
for the indexing component

docs/architecture.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ using powerful indexing and real-time event processing.
5252
- Manage index lifecycle (creation, updates, deletion)
5353
- Bootstrap indexes from existing state
5454

55+
See the [Resource Indexer Architecture](./components/resource-indexer.md) for
56+
detailed design documentation.
57+
5558
### Controller Manager
5659

5760
**Purpose**: Manages and validates resources for the search service
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
<!-- omit from toc -->
2+
# Resource Indexer Architecture
3+
4+
- [Overview](#overview)
5+
- [Design Goals](#design-goals)
6+
- [Core Responsibilities](#core-responsibilities)
7+
- [Event Consumption](#event-consumption)
8+
- [Horizontal Scaling](#horizontal-scaling)
9+
- [Policy Management](#policy-management)
10+
- [CEL Compilation](#cel-compilation)
11+
- [Document Transformation](#document-transformation)
12+
- [Persistence and Acknowledgment](#persistence-and-acknowledgment)
13+
- [Batching](#batching)
14+
- [Duplicate Handling](#duplicate-handling)
15+
- [Bootstrap Process](#bootstrap-process)
16+
- [Multi-Cluster Bootstrap](#multi-cluster-bootstrap)
17+
- [Error Handling](#error-handling)
18+
- [Integration Points](#integration-points)
19+
- [Future Considerations](#future-considerations)
20+
21+
22+
## Overview
23+
24+
The Resource Indexer is a core component of the Search service responsible for
25+
maintaining a searchable index of platform resources. It consumes audit log
26+
events from NATS JetStream, applies policy-based filtering, and writes indexed
27+
documents to the search backend.
28+
29+
## Design Goals
30+
31+
- **Real-time indexing**: Process resource changes within seconds of occurrence
32+
- **Policy-driven**: Index only resources matching active IndexPolicy
33+
configurations
34+
- **Reliable delivery**: Guarantee at-least-once processing of all events
35+
- **Graceful recovery**: Resume processing from last known position after
36+
restarts
37+
- **Horizontal scalability**: Scale throughput by adding instances without
38+
coordination
39+
- **Minimal resource footprint**: Operate efficiently within constrained
40+
environments
41+
42+
## Core Responsibilities
43+
44+
The Resource Indexer handles:
45+
46+
- Consuming audit log events from NATS JetStream
47+
- Watching IndexPolicy resources and evaluating CEL filters
48+
- Transforming Kubernetes resources into searchable documents
49+
- Persisting documents to the index backend
50+
- Acknowledging events only after successful persistence
51+
52+
### Event Processing Flow
53+
54+
The following diagram illustrates how the indexer processes events, including
55+
policy matching, batching, and acknowledgment handling:
56+
57+
```mermaid
58+
sequenceDiagram
59+
participant JS as NATS JetStream
60+
participant Indexer as Resource Indexer
61+
participant Cache as Policy Cache
62+
participant Meili as Meilisearch
63+
64+
rect rgb(240, 248, 255)
65+
note right of JS: Event Matches Policy
66+
JS->>Indexer: Deliver audit event
67+
Indexer->>Cache: Evaluate policies
68+
Cache-->>Indexer: Policy match + compiled CEL
69+
Indexer->>Indexer: Evaluate CEL filter
70+
Indexer->>Indexer: Transform resource to document
71+
Indexer->>Indexer: Add to batch
72+
73+
alt Batch ready (size or time threshold)
74+
Indexer->>Meili: Persist document batch
75+
Meili-->>Indexer: Success
76+
Indexer->>JS: Ack all events in batch
77+
end
78+
end
79+
80+
rect rgb(255, 248, 240)
81+
note right of JS: Event Does Not Match Policy
82+
JS->>Indexer: Deliver audit event
83+
Indexer->>Cache: Evaluate policies
84+
Cache-->>Indexer: No matching policy
85+
Indexer->>JS: Ack (discard event)
86+
end
87+
88+
rect rgb(255, 240, 240)
89+
note right of JS: Persistence Failure
90+
JS->>Indexer: Deliver audit event
91+
Indexer->>Indexer: Transform and batch
92+
Indexer->>Meili: Persist document batch
93+
Meili-->>Indexer: Error
94+
note right of Indexer: Do not ack — JetStream<br/>redelivers after timeout
95+
end
96+
```
97+
98+
## Event Consumption
99+
100+
The indexer consumes audit log events from [NATS JetStream][jetstream] using
101+
[durable consumers][durable-consumers]. JetStream provides:
102+
103+
- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts
104+
- **Position tracking**: Durable consumers track acknowledged messages; on
105+
restart, consumption resumes from the last acknowledged position
106+
- **Backpressure**: Pull-based consumption allows the indexer to control its
107+
processing rate
108+
109+
[jetstream]: https://docs.nats.io/nats-concepts/jetstream
110+
[durable-consumers]: https://docs.nats.io/nats-concepts/jetstream/consumers#durable-consumers
111+
112+
### Horizontal Scaling
113+
114+
The indexer uses JetStream [queue groups] for horizontal scaling. When multiple
115+
instances join the same queue group, JetStream distributes messages across them
116+
automatically — each message is delivered to exactly one instance.
117+
118+
![Resource Indexer horizontal scaling diagram](../diagrams/ResourceIndexerScaling.png)
119+
120+
This enables linear throughput scaling without coordination between instances.
121+
122+
[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue
123+
124+
## Policy Management
125+
126+
IndexPolicy resources define what to index. The indexer watches these resources
127+
using a Kubernetes [informer], which provides:
128+
129+
- **List-watch semantics**: Initial list of all policies followed by a watch
130+
stream for changes
131+
- **Local cache**: In-memory store for fast lookups during event processing
132+
- **Automatic resync**: Periodic re-list to correct any drift
133+
134+
Each indexer instance maintains its own policy cache. Since events can be routed
135+
to any instance (via queue groups), each instance caches all policies.
136+
IndexPolicy resources are typically small and few in number, so this
137+
replication is acceptable.
138+
139+
### CEL Compilation
140+
141+
[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid
142+
recompilation on every event, compile expressions when policies are added or
143+
updated and cache the compiled programs alongside the policy.
144+
145+
The indexer should wait for the informer cache to sync before processing events
146+
to ensure all active policies are available for matching.
147+
148+
[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer
149+
[CEL]: https://cel.dev
150+
151+
## Document Transformation
152+
153+
When an event matches a policy, the indexer transforms the Kubernetes resource
154+
into a searchable document:
155+
156+
- Extract fields specified in the IndexPolicy field mappings
157+
- Normalize metadata (labels, annotations) into searchable formats
158+
- Use the resource's UID as the document identifier
159+
160+
## Persistence and Acknowledgment
161+
162+
Documents are persisted to the index backend ([Meilisearch]). To guarantee
163+
at-least-once delivery, events are only acknowledged after successful
164+
persistence.
165+
166+
[Meilisearch]: https://www.meilisearch.com/docs
167+
168+
### Batching
169+
170+
For efficiency, batch multiple documents into a single write request. When a
171+
batch completes:
172+
173+
1. Persist all documents to the index backend
174+
2. On success, acknowledge all events in the batch
175+
3. On failure, do not acknowledge — JetStream redelivers after ack timeout
176+
177+
Events that don't match any policy should be acknowledged immediately to prevent
178+
reprocessing.
179+
180+
### Duplicate Handling
181+
182+
At-least-once delivery means duplicates are possible (e.g., after a failure
183+
before acknowledgment). The index backend handles this via [document primary
184+
keys][meilisearch-primary-key] — reindexing the same resource overwrites the
185+
existing document.
186+
187+
[meilisearch-primary-key]: https://www.meilisearch.com/docs/learn/core_concepts/primary_key
188+
189+
## Bootstrap Process
190+
191+
On startup or when a new IndexPolicy is created, the indexer must populate the
192+
index with existing resources. The platform spans multiple project control
193+
planes, so bootstrap must list resources from each cluster.
194+
195+
### Multi-Cluster Bootstrap
196+
197+
The indexer uses the [multicluster-runtime] provider pattern to discover
198+
project control planes. For each discovered cluster:
199+
200+
1. List resources matching the policy selector from that cluster's API
201+
2. Transform and index each resource
202+
3. Handle concurrent modifications during bootstrap gracefully
203+
204+
The provider handles dynamic cluster discovery — as clusters come online or go
205+
offline, the indexer bootstraps or cleans up accordingly.
206+
207+
After bootstrap completes, real-time indexing continues via the JetStream event
208+
stream, which already aggregates events from all control planes.
209+
210+
[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime
211+
212+
## Error Handling
213+
214+
- **Transient failures**: Retry with exponential backoff for network errors and
215+
temporary unavailability
216+
- **Malformed events**: Log and skip events that cannot be parsed; acknowledge
217+
to prevent redelivery loops
218+
- **Backend unavailability**: Buffer events in memory (bounded) while attempting
219+
reconnection; pause consumption if buffer fills
220+
- **Policy evaluation errors**: Log and skip events with CEL evaluation
221+
failures; do not block processing of other events
222+
223+
## Integration Points
224+
225+
| System | Protocol | Purpose |
226+
|--------|----------|---------|
227+
| [NATS JetStream][jetstream] | NATS | Consume audit log events (aggregated from all clusters) |
228+
| Search API Server | HTTPS | Watch IndexPolicy resources |
229+
| Project Control Planes | HTTPS | Bootstrap existing resources |
230+
| [Meilisearch] | HTTPS/JSON | Persist indexed documents |
231+
232+
## Future Considerations
233+
234+
- **Control plane deletion**: When a project control plane is deleted, indexed
235+
resources from that cluster must be cleaned up. Ideally, the platform emits
236+
deletion events for all resources before the control plane is removed,
237+
allowing event-driven cleanup. If this isn't guaranteed, the indexer may need
238+
to track source cluster metadata and delete documents when a cluster is
239+
disengaged.
240+
- **Dead letter handling**: Route persistently failing events to a dead letter
241+
queue for manual inspection
242+
- **Metrics and observability**: Expose indexing lag, throughput, and error
243+
rates via Prometheus
244+
- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing
245+
- **Policy-based sharding**: For very large deployments, assign subsets of
246+
policies to instances using consistent hashing
21 KB
Loading
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
@startuml ResourceIndexerScaling
2+
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Container.puml
3+
4+
title Resource Indexer Horizontal Scaling
5+
6+
System_Ext(jetstream, "NATS JetStream", "Audit log event stream aggregated from all control planes")
7+
8+
System_Boundary(indexerGroup, "Queue Group: resource-indexer") {
9+
Container(indexer1, "Indexer #1", "Go", "Processes subset of events")
10+
Container(indexer2, "Indexer #2", "Go", "Processes subset of events")
11+
Container(indexer3, "Indexer #3", "Go", "Processes subset of events")
12+
}
13+
14+
System_Ext(meilisearch, "Meilisearch", "Search index backend")
15+
16+
Rel_D(jetstream, indexer1, "Delivers event", "NATS")
17+
Rel_D(jetstream, indexer2, "Delivers event", "NATS")
18+
Rel_D(jetstream, indexer3, "Delivers event", "NATS")
19+
20+
Rel_D(indexer1, meilisearch, "Writes documents", "HTTPS")
21+
Rel_D(indexer2, meilisearch, "Writes documents", "HTTPS")
22+
Rel_D(indexer3, meilisearch, "Writes documents", "HTTPS")
23+
24+
SHOW_LEGEND()
25+
26+
@enduml

0 commit comments

Comments
 (0)