Skip to content

Commit 5af8c63

Browse files
committed
feat: define resource indexer architecture
Creates a new architecture document that goes into detail on the design of the indexing service.
1 parent 557854a commit 5af8c63

File tree

3 files changed

+208
-0
lines changed

3 files changed

+208
-0
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,10 @@ using CEL-based filtering. The service integrates natively with kubectl/RBAC and
88
targets Meilisearch as the search backend.
99

1010
![](./docs/diagrams/SearchServiceContext.png)
11+
12+
## Documentation
13+
14+
- [Architecture](./docs/architecture.md) — High-level design and component
15+
overview
16+
- [Resource Indexer](./docs/components/resource-indexer.md) — Detailed design
17+
for the indexing component

docs/architecture.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ using powerful indexing and real-time event processing.
5252
- Manage index lifecycle (creation, updates, deletion)
5353
- Bootstrap indexes from existing state
5454

55+
See the [Resource Indexer Architecture](./components/resource-indexer.md) for
56+
detailed design documentation.
57+
5558
### Controller Manager
5659

5760
**Purpose**: Manages and validates resources for the search service
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
<!-- omit from toc -->
2+
# Resource Indexer Architecture
3+
4+
- [Overview](#overview)
5+
- [Design Goals](#design-goals)
6+
- [Core Responsibilities](#core-responsibilities)
7+
- [Event Consumption](#event-consumption)
8+
- [Horizontal Scaling](#horizontal-scaling)
9+
- [Policy Management](#policy-management)
10+
- [CEL Compilation](#cel-compilation)
11+
- [Document Transformation](#document-transformation)
12+
- [Persistence and Acknowledgment](#persistence-and-acknowledgment)
13+
- [Batching](#batching)
14+
- [Duplicate Handling](#duplicate-handling)
15+
- [Bootstrap Process](#bootstrap-process)
16+
- [Error Handling](#error-handling)
17+
- [Integration Points](#integration-points)
18+
- [Future Considerations](#future-considerations)
19+
20+
21+
## Overview
22+
23+
The Resource Indexer is a core component of the Search service responsible for
24+
maintaining a searchable index of platform resources. It consumes audit log
25+
events from NATS JetStream, applies policy-based filtering, and writes indexed
26+
documents to the search backend.
27+
28+
## Design Goals
29+
30+
- **Real-time indexing**: Process resource changes within seconds of occurrence
31+
- **Policy-driven**: Index only resources matching active IndexPolicy
32+
configurations
33+
- **Reliable delivery**: Guarantee at-least-once processing of all events
34+
- **Graceful recovery**: Resume processing from last known position after
35+
restarts
36+
- **Minimal resource footprint**: Operate efficiently within constrained
37+
environments
38+
39+
## Core Responsibilities
40+
41+
The Resource Indexer handles:
42+
43+
- Consuming audit log events from NATS JetStream
44+
- Watching IndexPolicy resources and evaluating CEL filters
45+
- Transforming Kubernetes resources into searchable documents
46+
- Persisting documents to the index backend
47+
- Acknowledging events only after successful persistence
48+
49+
## Event Consumption
50+
51+
The indexer consumes audit log events from NATS JetStream using durable
52+
consumers. JetStream provides:
53+
54+
- **Delivery guarantees**: At-least-once delivery with configurable ack timeouts
55+
- **Position tracking**: Durable consumers track acknowledged messages; on
56+
restart, consumption resumes from the last acknowledged position
57+
- **Backpressure**: Pull-based consumption allows the indexer to control its
58+
processing rate
59+
60+
### Horizontal Scaling
61+
62+
The indexer uses JetStream [queue groups] for horizontal scaling. When multiple
63+
instances join the same queue group, JetStream distributes messages across them
64+
automatically — each message is delivered to exactly one instance.
65+
66+
```
67+
Queue Group: "resource-indexer"
68+
69+
┌───────────────────────┼───────────────────────┐
70+
│ │ │
71+
▼ ▼ ▼
72+
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
73+
│ Indexer #1 │ │ Indexer #2 │ │ Indexer #3 │
74+
└──────────────┘ └──────────────┘ └──────────────┘
75+
```
76+
77+
This enables linear throughput scaling without coordination between instances.
78+
79+
[queue groups]: https://docs.nats.io/nats-concepts/core-nats/queue
80+
81+
## Policy Management
82+
83+
IndexPolicy resources define what to index. The indexer watches these resources
84+
using a Kubernetes [informer], which provides:
85+
86+
- **List-watch semantics**: Initial list of all policies followed by a watch
87+
stream for changes
88+
- **Local cache**: In-memory store for fast lookups during event processing
89+
- **Automatic resync**: Periodic re-list to correct any drift
90+
91+
Each indexer instance maintains its own policy cache. Since events can be routed
92+
to any instance (via queue groups), each instance caches all policies.
93+
IndexPolicy resources are typically small and few in number, so this
94+
replication is acceptable.
95+
96+
### CEL Compilation
97+
98+
[CEL expressions][CEL] in policies must be compiled before evaluation. To avoid
99+
recompilation on every event, compile expressions when policies are added or
100+
updated and cache the compiled programs alongside the policy.
101+
102+
The indexer should wait for the informer cache to sync before processing events
103+
to ensure all active policies are available for matching.
104+
105+
[informer]: https://pkg.go.dev/k8s.io/client-go/tools/cache#SharedInformer
106+
[CEL]: https://cel.dev
107+
108+
## Document Transformation
109+
110+
When an event matches a policy, the indexer transforms the Kubernetes resource
111+
into a searchable document:
112+
113+
- Extract fields specified in the IndexPolicy field mappings
114+
- Normalize metadata (labels, annotations) into searchable formats
115+
- Use the resource's UID as the document identifier
116+
117+
## Persistence and Acknowledgment
118+
119+
Documents are persisted to the index backend (Meilisearch). To guarantee
120+
at-least-once delivery, events are only acknowledged after successful
121+
persistence.
122+
123+
### Batching
124+
125+
For efficiency, batch multiple documents into a single write request. When a
126+
batch completes:
127+
128+
1. Persist all documents to the index backend
129+
2. On success, acknowledge all events in the batch
130+
3. On failure, do not acknowledge — JetStream redelivers after ack timeout
131+
132+
Events that don't match any policy should be acknowledged immediately to prevent
133+
reprocessing.
134+
135+
### Duplicate Handling
136+
137+
At-least-once delivery means duplicates are possible (e.g., after a failure
138+
before acknowledgment). The index backend handles this via document ID upserts —
139+
reindexing the same resource overwrites the existing document.
140+
141+
## Bootstrap Process
142+
143+
On startup or when a new IndexPolicy is created, the indexer must populate the
144+
index with existing resources. The platform spans multiple project control
145+
planes, so bootstrap must list resources from each cluster.
146+
147+
### Multi-Cluster Bootstrap
148+
149+
The indexer uses the [multicluster-runtime] provider pattern to discover
150+
project control planes. For each discovered cluster:
151+
152+
1. List resources matching the policy selector from that cluster's API
153+
2. Transform and index each resource
154+
3. Handle concurrent modifications during bootstrap gracefully
155+
156+
The provider handles dynamic cluster discovery — as clusters come online or go
157+
offline, the indexer bootstraps or cleans up accordingly.
158+
159+
After bootstrap completes, real-time indexing continues via the JetStream event
160+
stream, which already aggregates events from all control planes.
161+
162+
[multicluster-runtime]: https://github.com/kubernetes-sigs/multicluster-runtime
163+
164+
## Error Handling
165+
166+
- **Transient failures**: Retry with exponential backoff for network errors and
167+
temporary unavailability
168+
- **Malformed events**: Log and skip events that cannot be parsed; acknowledge
169+
to prevent redelivery loops
170+
- **Backend unavailability**: Buffer events in memory (bounded) while attempting
171+
reconnection; pause consumption if buffer fills
172+
- **Policy evaluation errors**: Log and skip events with CEL evaluation
173+
failures; do not block processing of other events
174+
175+
## Integration Points
176+
177+
| System | Protocol | Purpose |
178+
|--------|----------|---------|
179+
| NATS JetStream | NATS | Consume audit log events (aggregated from all clusters) |
180+
| Search API Server | HTTPS | Watch IndexPolicy resources |
181+
| Project Control Planes | HTTPS | Bootstrap existing resources |
182+
| Meilisearch | HTTPS/JSON | Persist indexed documents |
183+
184+
## Future Considerations
185+
186+
- **Control plane deletion**: When a project control plane is deleted, indexed
187+
resources from that cluster must be cleaned up. Ideally, the platform emits
188+
deletion events for all resources before the control plane is removed,
189+
allowing event-driven cleanup. If this isn't guaranteed, the indexer may need
190+
to track source cluster metadata and delete documents when a cluster is
191+
disengaged.
192+
- **Dead letter handling**: Route persistently failing events to a dead letter
193+
queue for manual inspection
194+
- **Metrics and observability**: Expose indexing lag, throughput, and error
195+
rates via Prometheus
196+
- **Multi-tenancy**: Support tenant-isolated indexes with policy-based routing
197+
- **Policy-based sharding**: For very large deployments, assign subsets of
198+
policies to instances using consistent hashing

0 commit comments

Comments
 (0)