Add connection storms documentation

dkropachev · dkropachev · commit fa82cdd1cbf7 · 2025-07-04T07:38:29.000-04:00
Sole goal of `ShardConnectionBackoffPolicy` existance is to fight
connection storms.
So, this commit adds connection storms section to `docs/faq.rst`
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -81,3 +81,57 @@ be specialized per statement by setting :attr:`.Statement.retry_policy`.
 Retries are presently attempted on the same coordinator, but this may change in the future.
 
 Please see :class:`.policies.RetryPolicy` for further details.
+
+How to fight connection storms ?
+--------------------------------
+
+Scylla employs a shard-per-core architecture to efficiently utilize CPU, memory, and cache.
+Data and load are effectively sharded not only across hosts but also across individual shards.
+
+Every ScyllaDB driver routes queries to a specific shard responsible for the requested data.
+This eliminates the need for routing logic on the server side.
+To support this, the driver maintains a separate connection to every shard in the cluster.
+
+The downside of this approach is the total number of connections that each host in the cluster must handle.
+Typically, the number of connections on a given node can be calculated as:
+number of clients × number of shards.
+
+For example, a node with 64 shards and 1,000 clients would handle 64,000 connections.
+In production clusters, the number of clients can reach hundreds of thousands, and nodes may need to handle up to 2 million connections.
+Once established, these connections consume very few resources.
+
+However, when nodes or clients are restarted, all these connections must be re-established.
+The process of establishing a new connection involves several steps:
+
+1. Establishing a TCP/TLS connection
+2. Protocol negotiation
+3. Authentication
+4. Running discovery queries: `SELECT * FROM system.local` and `SELECT * FROM system.peers`
+
+If any of these steps fail, the connection is dropped, and a new attempt is made to connect to the same shard, restarting the process.
+
+When a large number of clients attempt to open connections to the same node simultaneously, it can lead to:
+
+1. CPU consumption spikes
+2. Latency spikes
+3. Service unavailability
+4. Clients failing to create and initialize connections
+
+To avoid these problems, it is important to limit the rate at which clients establish connections to nodes.
+
+For this purpose, we introduced the `ShardConnectionBackoffPolicy` specifically, the `LimitedConcurrencyShardConnectionBackoffPolicy`.
+This policy does two things:
+
+1. Introduces a backoff delay between each connection the driver creates to a host
+2. Limits the number of pending connection attempts
+
+For example, with a configuration allowing only `1` pending connection and a backoff of `0.1` seconds:
+.. code-block:: python
+
+    cluster = Cluster(
+        shard_connection_backoff_policy = LimitedConcurrencyShardConnectionBackoffPolicy(
+            backoff_policy=ConstantShardConnectionBackoffSchedule(0.1),
+            max_concurrent=1,
+        )
+    )
+