CDs: implement constant-time convergence scaling with node count (via per-clique sharding)

Problem: a customer confirmed that CD convergence is still slow for large node counts -- see #816. Solutions: in November 2025, after our last round of improvements, we [took note of](https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/732#issuecomment-3575745012) two solution strategies that we would look into once necessary:

> Any kind of **sharding on a per-clique level** may be super useful. And/or **server-side applies**.

We went ahead with server-side applies (SSA) (in #822). Subsequently, we measured that SSA-based convergence actually performs and scales worse than the pre-SSA convergence method. We then started to explore per-clique sharding (in #826). Initial measurements suggest that it yields the desired scaling behavior. A selection of measurement results are shown in the plot(s) below.

#### Convergence time over node count for different convergence methods

<img width="750" alt="Image" src="https://github.com/user-attachments/assets/39d9426d-6721-44e9-aa09-fbe86efaf253" />

Main conclusions:
* The gray data points correspond to the pre-SSA method --  it is roughly a straight line (in a log-log plot) and hence confirms exponential growth.
* The magenta samples confirms that per-clique sharding (as proposed by #826) indeed can lead to quasi-**constant-time** scaling behavior.
* The dashed lines represent SSA / SSA-with-fixes; they show that SSA performs worse than the pre-SSA method for more than just a hand full of nodes (we tested the SSA patch #822 with just four nodes before merging).

Caveats:
* The measurement environment was a simulation environment (https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/827). The real environment may have other relevant bottlenecks that we couldn't cover in this simulation environment -- it's unlikely, but has to be pointed out.
* We still need to measure larger N -- towards O(10**4). The per-clique sharding technique effectively relies on being able to make thousands of independent write requests to the API server _per second_. The API server seemingly _can_ do it, but of course depending on how exactly it's deployed and other workload in the cluster there will be a natural point of contention.
* Each data point above corresponds to one measurement. There of course is variance across repetitions which we did not thoroughly measure. The main conclusions with respect to the scaling behavior of the different methods are likely to be robust. In the future, we'll measure variance through repetitions. "Eine Messung ist keine Messung".  

Appendix: same plot, using linear scales instead of a log-log representation (click to enlarge):
<img width="250" alt="Image" src="https://github.com/user-attachments/assets/929bec16-a28c-4f3a-bf65-cdc43ab98255" />





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDs: implement constant-time convergence scaling with node count (via per-clique sharding) #829

Convergence time over node count for different convergence methods

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CDs: implement constant-time convergence scaling with node count (via per-clique sharding) #829

Description

Convergence time over node count for different convergence methods

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions