Skip to content

Commit 138c959

Browse files
committed
add draft ddos post
1 parent 0e2d431 commit 138c959

File tree

1 file changed

+201
-0
lines changed

1 file changed

+201
-0
lines changed
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
title: "The Subtle Art of Taming Flows and Coroutines in Kotlin, or 'How Not to DDoS Yourself with Server-Sent Events'"
3+
date: 2025-09-06
4+
description: "A tale of how elegant SSE code passed code review, worked perfectly locally and in staging, but nearly brought down our production servers when thousands of users connected simultaneously during a real DDoS attack."
5+
tags:
6+
[
7+
"kotlin",
8+
"ktor",
9+
"server-sent-events",
10+
"coroutines",
11+
"flow",
12+
"performance",
13+
"production",
14+
]
15+
draft: true
16+
---
17+
18+
I originally wanted to write a post about Server-Sent Events in general, and how delightfully cool they are. SSE provides a clean, standardized way for servers to push real-time updates to web clients over a simple HTTP connection. The [MDN documentation](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) showcases how straightforward the client-side implementation is, while [Ktor's SSE support](https://ktor.io/docs/server-server-sent-events.html) makes the server-side equally elegant. SSE strikes a perfect balance: simpler than WebSockets when you only need one-way communication, yet more efficient than polling.
19+
20+
But this isn't that post.
21+
22+
Instead, this is a story about how seemingly innocent Flow and coroutine code can bite you in production in the most unexpected ways. It's about the subtle difference between "working" and "working under load." And it's about how a tiny change in flow control can mean the difference between a robust server and an accidental self-DDoS.
23+
24+
## The Setup: A Perfect Storm
25+
26+
Picture this: It's a Friday around lunch-time. Our team has just deployed a beautiful new SSE endpoint for real-time notifications. The code passed code review with flying colors, worked flawlessly in local development, and sailed through our staging environment. We were proud of our clean, idiomatic Kotlin—a textbook example of modern coroutine and Flow usage.
27+
28+
Then we deployed to production.
29+
30+
At the exact same time, a known hacker group decided to launch a DDoS attack against our infrastructure. Thousands of legitimate users were online, each with active SSE connections for real-time updates. The combination of external attack traffic and internal connection management created the perfect storm.
31+
32+
Our servers didn't just struggle—they started consuming resources at an alarming rate. Memory usage spiked, CPU utilization maxed out, and we were essentially DDoS'ing ourselves from the inside while fighting off the external attack.
33+
34+
## The Puzzle: Two Approaches, One Problem
35+
36+
Here's the code that went to production. Can you spot which approach will leak resources under load?
37+
38+
### Approach A: Collect & Return
39+
40+
```kotlin
41+
routing {
42+
sse("/events") {
43+
val sessionId = call.sessionId()
44+
val eventFlow: Flow<String> = merge(someGlobalEventFlow, someClientSpecificEventFlow(sessionId))
45+
.map { event -> Pair(event, checkIfClientIsAuthenticated(sessionId)) }
46+
47+
// Approach A: collect && return
48+
eventFlow.collect { (event, clientIsAuthenticated) ->
49+
// if the client is not authenticated, close the connection and stop collecting events
50+
if (!clientIsAuthenticated) {
51+
return@collect
52+
}
53+
54+
// try to send the event to the client, returning true if the client is still connected
55+
val clientIsConnected = trySendEvent(event)
56+
if (!clientIsConnected) {
57+
return@collect
58+
}
59+
}
60+
61+
close()
62+
}
63+
}
64+
```
65+
66+
### Approach B: OnEach & Cancel
67+
68+
```kotlin
69+
routing {
70+
sse("/events") {
71+
val sessionId = call.sessionId()
72+
val eventFlow: Flow<String> = merge(someGlobalEventFlow, someClientSpecificEventFlow(sessionId))
73+
.map { event -> Pair(event, checkIfClientIsAuthenticated(sessionId)) }
74+
75+
// Approach B: onEach && cancel
76+
try {
77+
eventFlow.onEach { (event, clientIsAuthenticated) ->
78+
// if the client is not authenticated, cancel the flow to stop collecting events
79+
if (!clientIsAuthenticated) {
80+
cancel()
81+
}
82+
83+
// try to send the event to the client, returning true if the client is still connected
84+
val clientIsConnected = trySendEvent(event)
85+
if (!clientIsConnected) {
86+
cancel()
87+
}
88+
}.collect { }
89+
} catch (e: CancellationException) {
90+
// do nothing, we've cancelled the flow intentionally
91+
} finally {
92+
close()
93+
}
94+
}
95+
}
96+
```
97+
98+
Both approaches look reasonable at first glance. Both handle authentication checking and client disconnection. Both compile cleanly and work perfectly with a handful of concurrent connections.
99+
100+
But only one of them will behave correctly under production load.
101+
102+
## The Difference: A Tale of Two Control Flows
103+
104+
The critical difference lies in how each approach handles early termination of the Flow collection.
105+
106+
### Approach A: The Resource Leak
107+
108+
```kotlin
109+
if (!clientIsAuthenticated) {
110+
return@collect // This only skips the current emission!
111+
}
112+
```
113+
114+
Here's the subtle trap: `return@collect` doesn't stop the collection—it only skips processing the current emission. The `collect` block continues waiting for the next emission from the Flow. This means:
115+
116+
1. The coroutine keeps running
117+
2. The SSE connection remains open
118+
3. The Flow continues producing events
119+
4. `close()` is never reached
120+
5. Resources accumulate with each "disconnected" client
121+
122+
So while the `return@collect` _appears_ to be the coroutine equivalent of a `break` within a regular loop, it's actually more similar to a `continue`. Precicely what we _don't_ want!
123+
124+
Under normal conditions with a few dozen connections, this might go unnoticed (and it sure did!). But when thousands of connections are established during a DDoS attack and then clients become unauthenticated or disconnect, those zombie collectors pile up quickly. Very quickly!
125+
126+
### Approach B: Clean Termination
127+
128+
```kotlin
129+
if (!clientIsAuthenticated) {
130+
cancel() // This cancels the collecting coroutine; think `break` whithin a loop
131+
}
132+
```
133+
134+
The `cancel()` call throws a `CancellationException`, which:
135+
136+
1. Terminates the collecting coroutine
137+
2. Exits the `collect` block
138+
3. Triggers the `finally` block
139+
4. Calls `close()` to clean up the SSE connection
140+
5. Properly releases all associated resources
141+
142+
The `try-catch-finally` structure ensures that when we intentionally cancel the operation, cleanup happens correctly.
143+
144+
## The Production Reality
145+
146+
During our incident, Approach A created a cascading resource leak. Every time a client disconnected or became unauthenticated (which happened frequently during the DDoS), we accumulated:
147+
148+
- An active coroutine waiting for the next Flow emission
149+
- An open SSE connection consuming server resources
150+
- Memory allocated for the Flow processing pipeline
151+
- Background tasks polling for authentication status
152+
153+
With thousands of connections being established and "abandoned" in this way, our servers quickly became overwhelmed—not just by the external attack, but by our own leaked resources.
154+
155+
## The Fix and Lessons Learned
156+
157+
The fix was embarrassingly simple: replace `return@collect` with `cancel()` and add proper exception handling. But the lessons were profound:
158+
159+
### 1. Load Testing Reveals Truth
160+
161+
Code that works with 10 concurrent connections might fail catastrophically with 10,000. Our staging environment, optimized for cost over scale, simply couldn't reproduce the production load patterns.
162+
163+
### 2. Resource Management Is Critical
164+
165+
In languages with garbage collection, it's easy to forget about resource leaks. But when dealing with network connections, coroutines, and flows, explicit cleanup becomes crucial.
166+
167+
### 3. Control Flow Matters
168+
169+
The difference between "skip this iteration" and "stop collecting" is subtle in code (and in this case _very_ easy to miss!) but massive in production impact. Understanding the exact semantics of coroutine cancellation is essential for robust server applications.
170+
171+
### 4. Timing Is Everything
172+
173+
Our code worked perfectly—until it didn't. The combination of high load and external pressure revealed edge cases that never appeared under normal conditions.
174+
175+
## Best Practices for SSE and Flow Management
176+
177+
1. **Always use explicit cancellation** when you need to terminate Flow collection early
178+
2. **Implement proper cleanup** in `finally` blocks or using `use` functions
179+
3. **Test under realistic load** with tools that can simulate thousands of concurrent connections
180+
4. **Monitor resource usage** in production to catch accumulation patterns early
181+
5. **Understand coroutine lifecycle** and how cancellation propagates through your system
182+
183+
## A Happy Ending
184+
185+
After deploying the fix, our servers stabilized even under the continued DDoS attack. The external attackers were eventually blocked, but more importantly, we learned that our internal code was resilient under extreme load.
186+
187+
The corrected approach handles thousands of SSE connections gracefully, properly cleaning up resources when clients disconnect, and maintaining predictable memory usage even under attack conditions.
188+
189+
## Conclusion
190+
191+
Server-Sent Events are indeed a powerful and elegant technology for real-time web applications. Kotlin's coroutines and Flow provide beautiful abstractions for handling asynchronous streams. But the devil, as always, is in the details.
192+
193+
The difference between `return@collect` and `cancel()` might seem trivial, but in production systems serving thousands of users, these subtleties become the difference between stability and catastrophic failure.
194+
195+
Sometimes the most dangerous bugs are the ones that hide in plain sight, looking perfectly reasonable until the moment they're not.
196+
197+
Remember: when dealing with flows and coroutines, always clean up your resources. Your production servers will thank you.
198+
199+
---
200+
201+
_Special thanks to the DDoS attackers for providing the load testing we apparently needed. Your service is not requested, but occasionally educational._

0 commit comments

Comments
 (0)