docs: add work partitioning section

irozzo-1A · irozzo-1A · commit 75d614a17345 · 2026-01-09T16:55:28.000+01:00
Signed-off-by: irozzo-1A &lt;iacopo@sysdig.com&gt;
diff --git a/proposals/20251205-multi-thread-falco-design.md b/proposals/20251205-multi-thread-falco-design.md
@@ -29,14 +29,113 @@ This document does not cover low-level implementation details that will be addre
 
 * The kernel driver (modern eBPF probe) writes event into per-TGID ring buffers. Only the modern eBPF probe is supported, as it relies on [BPF_MAP_TYPE_RINGBUF](https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_RINGBUF/) which does not have a per-CPU design as opposed of the `BPF_MAP_TYPE_PERF_EVENT_ARRAY` used by the legacy eBPF probe.
 * Each buffer is associated with an event loop worker thread, that processes events from its assigned ring buffer.
-* The `libsinsp` state, e.g. the thread state is maintained in a shared data structure, allowing all workers to access data pushed by other workers. This is crucial for handling events like clone() that rely on data written by other partitions.
-This requires designing lightweight synchronization mechanisms to ensure efficient access to shared state without introducing significant contention. A dedicated proposal document will address the design of the shared state and synchronization mechanisms, and data consistency.
-* Falco's rule evaluation is performed in parallel by multiple worker threads, each evaluating rules against the events they process.
-Current Falco plugins are not supposed to be thread-safe. A dedicated proposal document will address the design of a thread-safe plugin architecture.
+* The `libsinsp` state, e.g. the thread state is maintained in a shared data structure, allowing all workers to access data pushed by other workers. This is crucial for handling events like clone() that rely on data written by other partitions. This requires designing lightweight synchronization mechanisms to ensure efficient access to shared state without introducing significant contention. A dedicated proposal document will address the design of the shared state and synchronization mechanisms, and data consistency.
+* Falco's rule evaluation is performed in parallel by multiple worker threads, each evaluating rules against the events they process. Current Falco plugins are not supposed to be thread-safe. A dedicated proposal document will address the design of a thread-safe plugin architecture.
+
+### Work Partitioning Strategies
+
+A crucial and challenging design aspect is partitioning the work to achieve a good trade-off among the following properties:
+
+1. **Even load balancing** between threads
+2. **Low contention** on shared data (or no shared data)
+3. **Avoiding temporal inconsistencies and causality violations** (e.g., processing a file-opening event before the related process-forking event)
+
+The first two properties are primarily focused on performance, while the third is essential for the correctness of the solution. These aspects are intrinsically linked.
+
+Based on the analysis below, **Static Partitioning by TGID** is the proposed approach for the initial implementation.
+
+#### Static Partitioning by TGID (Thread Group ID / Process ID)
+
+Events are routed based on the TGID at the kernel driver level to a ring-buffer dedicated to a specific partition. This partition is then consumed by a dedicated worker thread. The routing can be accomplished with a simple modulo operation, depending on the desired number of worker threads:
+
+```
+hash(event->tgid) % num_workers
+```
+
+**Pros:**
+
+* Reduced need for thread synchronization. Only fork/clone and proc exit events require synchronization, as handling them necessitates accessing/writing thread information from/to threads, which might reside in a different partition.
+* Guarantee of sequential order processing of events related to the same thread group/process, as they are handled by the same worker thread. This limits the chance of temporal inconsistencies.
+
+**Cons:**
+
+* **Load Imbalance / "Hot" Process Vulnerability**: This static partitioning is susceptible to uneven worker load distribution, as a small number of high-activity ("hot") processes can overload the specific worker thread assigned to their TGID.
+* **Cross-Partition Temporal Inconsistency**: Events that require information from a parent thread (e.g., fork/clone events) can still lead to causality issues. If the parent's related event is handled by a different, lagging partition, the required context might be incomplete or arrive out of order. Note that load imbalance amplifies this issue. Missing thread information is easy to detect, but there are also cases where information is present but not up-to-date or ahead of the time the clone event happened.
+
+**Mitigations:**
+
+* **Last-Resort Fetching**: Fetching the thread information from a different channel to resolve the drift (e.g., proc scan, eBPF iterator). This solution is considered as a last resort because it risks slowing down the event processing loop, potentially negating the performance benefits of multi-threading.
+
+* **Context Synchronization**: Wait for the required thread information to become available. This can be decomposed into two orthogonal concerns:
+
+  **How to handle the wait:**
+
+  * **Wait/Sleep (Blocking)**: The worker thread blocks (sleeping or spinning) until the required data becomes available. Simple to implement, but the worker is idle during the wait, reducing throughput.
+  * **Deferring (Non-blocking)**: The event is copied/buffered for later processing; the worker continues with other events from its ring buffer. More complex (requires event copying, a pending queue, and a retry mechanism), but keeps the worker productive.
+
+  **How to detect data readiness:**
+
+  * **Polling**: Periodically check if the required data is available (spin-check for Wait/Sleep, or periodic retry for Deferring). Simple but wastes CPU cycles.
+  * **Signaling**: Partitions proactively notify each other when data is ready. More efficient but requires coordination infrastructure (e.g., condition variables, eventfd, or message queues).
+
+  These combine into four possible approaches:
+
+  | | Polling | Signaling |
+  |---|---------|-----------|
+  | **Wait/Sleep** | Spin-check until ready | Sleep on condition variable, wake on signal |
+  | **Deferring** | Periodically retry deferred events | Process deferred events when signaled |
+
+  **Synchronization point**: A natural synchronization point is the **clone exit parent event**. At this point, the parent process has completed setting up the child's initial state (inherited file descriptors, environment, etc.), making it safe to start processing events for the newly created thread group.
+
+  **Special case — `vfork()` / `CLONE_VFORK`**: When `vfork()` is used, the parent thread is blocked until the child calls `exec()` or exits, delaying the clone exit parent event. An alternative synchronization point may be needed (e.g., adding back clone enter parent).
+
+### Other Considered Approaches
+
+#### Static Partitioning by TID (Thread ID)
+
+Similar to the previous approach, but events are routed by TID instead of TGID.
+
+**Pros:**
+
+* Guarantee of sequential order processing of events related to the same thread, as they are handled by the same worker thread. This limits the chance of temporal inconsistencies.
+* Good enough load balancing between partitions.
+
+**Cons:**
+
+* **Cross-Partition Temporal Inconsistency**: We can have temporal inconsistencies when accessing/writing information from/to other processes or from the Thread Group Leader (e.g., environment, file descriptor information is stored in the thread group leader).
+
+#### Functional Partitioning (Pipelining)
+
+Instead of partitioning the data, this approach partitions the work by splitting processing into phases:
+
+1. **Extraction**: Runs in a single thread, the state is updated in this phase.
+2. **Processing**: Runs in a thread chosen from a worker thread pool, the state is accessed but not modified. The Rule Matching takes place in this phase.
+
+**Pros:**
+
+* The state handling remains single-threaded, avoiding any synchronization issue on the write side.
+* The load balancing of the Processing phase is good as it does not require any form of stickiness—every worker can take whatever event, and a simple round-robin policy can be applied.
+
+**Cons:**
+
+* The "Extraction" stage is likely to become the bottleneck; a single thread here limits total throughput regardless of how many cores you have.
+* As we are parallelizing extraction and processing phases, we need some sort of MVCC (multi-version concurrency control) technique to maintain multiple levels of state depending on the in-flight events to ensure data consistency.
+* Processing multiple events in parallel involves changes at the driver and libscap level. At the moment we are processing one event at a time from the driver memory without copying. To be able to process multiple events in parallel, we need to adapt the ring-buffer to make sure that `next()` does not consume the event. We would also need some flow control (e.g., backpressure) to avoid processing too many events in parallel. This last problem would arise only if the processing phase is slower than the extraction phase.
+
+#### Comparison Summary
+
+| Approach | Load Balancing | Contention | Temporal Consistency |
+|----------|----------------|------------|----------------------|
+| TGID | Moderate (hot process risk) | Low | Good (within process) |
+| TID | Good | Higher | Partial (thread-level only) |
+| Pipelining | Good (processing phase) | Low (writes) | Requires MVCC |
+
+#### Rationale for TGID Partitioning
+
+TGID partitioning was chosen because it offers the best balance between synchronization complexity and correctness guarantees. TID partitioning increases cross-partition access for thread group leader data (e.g. file descriptor table, working directory, environment variables), increasing the coordination cost. Functional partitioning, while elegant in its separation of concerns, introduces a single-threaded bottleneck in the extraction phase that limits scalability regardless of available cores, and requires complex MVCC mechanisms for data consistency and mechanism for handling multiple events in parallel.
 
 ### Risks and Mitigations
 
 - **Increased Complexity**: Multi-threading introduces complexity in terms of synchronization and state management. Mitigation: Careful design of shared state and synchronization mechanisms, along with thorough testing.
 - **Synchronization Overhead vs Performance Gains**: The overhead of synchronization might negate the performance benefits of multi-threading. Mitigation: Use lightweight synchronization techniques and minimize shared state access.
 - **Synchronization Overhead vs Data Consistency**: In order to keep the synchronization overhead low with the shared state, we might need to relax some data consistency guarantees. Mitigation: Analyze the trade-offs and ensure that any relaxed guarantees do not compromise security.
-- **Uneven load balancing**: On large systems with a few syscall intensive processes, the load might not be evenly distributed across worker threads. Mitigation: Evaluate different load balancing strategies, such as per-TID. This would increase the contention on the shared state, so a careful analysis of the trade-offs is needed.