|
| 1 | +Asynchronous Programming 101 |
| 2 | +============================ |
| 3 | + |
| 4 | + |
| 5 | +The Story |
| 6 | +--------- |
| 7 | + |
| 8 | +Let's say we want to build a search engine. We'll use a single core computer to |
| 9 | +build our index. To make things simpler, our tasks are to fetch web pages |
| 10 | +(I/O operation), and process their content (CPU operation). Each task looks |
| 11 | +like this: |
| 12 | + |
| 13 | +.. image:: ../images/why_single_task.png |
| 14 | + :align: center |
| 15 | + |
| 16 | +We have lots of web pages to index, so we simply handle them one by one: |
| 17 | + |
| 18 | +.. image:: ../images/why_throughput.png |
| 19 | + :align: center |
| 20 | + |
| 21 | +Let's assume the time of each task is constant: each second, 2 tasks are done. |
| 22 | +Thus we can say what the throughput of the current system is 2 tasks/sec. How |
| 23 | +can we improve the throughput? An obvious answer is to add more CPU cores: |
| 24 | + |
| 25 | +.. image:: ../images/why_multicore.png |
| 26 | + :align: center |
| 27 | + |
| 28 | +This simply doubles our throughput to 4 tasks/sec, and linearly scales as we |
| 29 | +add more CPU cores, if the network is not a bottleneck. But can we improve |
| 30 | +the throughput for each CPU core? The answer is yes, we can use |
| 31 | +multi-threading: |
| 32 | + |
| 33 | +.. image:: ../images/why_multithreading.png |
| 34 | + :align: center |
| 35 | + |
| 36 | +Wait a second! The 2 threads barely finished 6 tasks in 2 seconds, a |
| 37 | +throughput of only 2.7 tasks/sec, much lower than 4 tasks/sec with 2 cores. |
| 38 | +What's wrong with multi-threading? From the diagram we can see: |
| 39 | + |
| 40 | +* There are yellow bars taking up extra time. |
| 41 | +* The green bars can still overlap with any bar in the other thread, but |
| 42 | +* non-green bars cannot overlap with non-green bars in the other thread. |
| 43 | + |
| 44 | +The yellow bars are time taken by `context switches |
| 45 | +<https://en.wikipedia.org/wiki/Context_switch>`_, a necessary part of allowing |
| 46 | +multiple threads or processes to run on a single CPU core concurrently. |
| 47 | +One CPU core can do only one thing at a time (let's assume a world without |
| 48 | +`Hyper-threading <https://en.wikipedia.org/wiki/Hyper-threading>`_ or similar), |
| 49 | +so in order to run several threads concurrently the CPU must `split its |
| 50 | +time <https://en.wikipedia.org/wiki/Time-sharing>`_ into small |
| 51 | +slices, and run a little bit of each thread within these slices. The yellow bar |
| 52 | +is the overhead for the CPU to switch context to run a different thread. The |
| 53 | +scale is a bit dramatic, but it helps with the point. |
| 54 | + |
| 55 | +Wait again here, the green bars are overlapping between threads. Is the CPU |
| 56 | +doing two things at the same time? No, the CPU is doing nothing in the middle |
| 57 | +of the green bar, because it's waiting for the HTTP response (I/O). That's how |
| 58 | +multi-threading could improve the throughput to 2.7 tasks/sec, instead of |
| 59 | +decreasing it to 1.7 tasks/sec. You may try in real to run CPU-intensive |
| 60 | +tasks with multi-threading on single core, there won't be any improvement. Like |
| 61 | +the multiplexed red bars (in practice there might be more context switches |
| 62 | +depending on the task), they appear to be running at the same time, but the |
| 63 | +total time for all to finish is actually longer than running the tasks one |
| 64 | +by one. That's also why this is called concurrency instead of parallelism. |
| 65 | + |
| 66 | +As you might imagine, throughput will improve less with each additional thread, |
| 67 | +until throughput begins to decrease because context switches are wasting too |
| 68 | +much time, not to mention the extra memory footprint taken by new threads. It |
| 69 | +is usually not practical to have tens of thousands of threads running on a single |
| 70 | +CPU core. How, then, is it possible to have tens of thousands of I/O-bound tasks |
| 71 | +running concurrently on a single CPU core? This is the once-famous `C10k |
| 72 | +problem <https://en.wikipedia.org/wiki/C10k_problem>`_, usually solved by |
| 73 | +asynchronous I/O: |
| 74 | + |
| 75 | +.. image:: ../images/why_coroutine.png |
| 76 | + :align: center |
| 77 | + |
| 78 | +.. note:: |
| 79 | + |
| 80 | + Asynchronous I/O and coroutines are two different things, but they usually |
| 81 | + go together. Here we will stick with coroutines for simplicity. |
| 82 | + |
| 83 | +Awesome! The throughput is 3.7 tasks/sec, nearly as good as 4 tasks/sec of 2 |
| 84 | +CPU cores. Though this is not real data, compared to OS threads coroutines |
| 85 | +do take much less time to context switch and have a lower memory footprint, |
| 86 | +thus making them an ideal option for the C10k problem. |
| 87 | + |
| 88 | + |
| 89 | +Cooperative multitasking |
| 90 | +------------------------ |
| 91 | + |
| 92 | +So what is a coroutine? |
| 93 | + |
| 94 | +In the last diagram above, you may have noticed a difference compared to the |
| 95 | +previous diagrams: the green bars are overlapping within the same thread. |
| 96 | +That is because the in the last diagram, our code is using asynchronous I/O, |
| 97 | +whereas the previously we were using blocking I/O. As the name suggests, blocking |
| 98 | +I/O will block the thread until the I/O result is ready. Thus, there can be only |
| 99 | +one blocking I/O operation running in a thread at a time. To achieve concurrency |
| 100 | +with blocking I/O, either multi-threading or multi-processing must be used. |
| 101 | +In contrast, asynchronous I/O allows thousands (or even more) of concurrent |
| 102 | +I/O reads and writes within the same thread, with each I/O operation blocking |
| 103 | +only the coroutine performing the I/O rather than the whole thread. Like |
| 104 | +multi-threading, coroutines provide a means to have concurrency during I/O, |
| 105 | +but unlike multi-threading this concurrency occurs within a single thread. |
| 106 | + |
| 107 | +Threads are scheduled by the operating system using an approach called `preemptive |
| 108 | +multitasking <https://en.wikipedia.org/wiki/Preemption_(computing)>`_. For |
| 109 | +example, in previous multi-threading diagram there was only one CPU core. When |
| 110 | +Thread 2 tried to start processing the first web page content, Thread 1 hadn't |
| 111 | +finished processing its own. The OS brutally interrupted Thread 1 and shared |
| 112 | +some resource (time) for Thread 2. But Thread 1 also needed CPU time to finish |
| 113 | +its processing at the same time, so in turn after a while the OS had to pause |
| 114 | +Thread 2 and resume Thread 1. Depending on the size of the task, such turns may |
| 115 | +happen several times, so that every thread may have a fair chance to run. It is |
| 116 | +something like this: |
| 117 | + |
| 118 | +.. code-block:: none |
| 119 | +
|
| 120 | + Thread 1: I wanna run! |
| 121 | + OS: Okay, here you go... |
| 122 | + Thread 2: I wanna run! |
| 123 | + OS: Urh, alright one sec ... Thread 1, hold on for a while! |
| 124 | + Thread 1: Well I'm not done yet, but you are the boss. |
| 125 | + OS: It won't be long. Thread 2 it's your turn now. |
| 126 | + Thread 2: Yay! (&%#$@..+*&#) |
| 127 | + Thread 1: Can I run now? |
| 128 | + OS: Just a moment please ... Thread 2, give it a break! |
| 129 | + Thread 2: Alright ... but I really need the CPU. |
| 130 | + OS: You'll have it later. Thread 1, hurry up! |
| 131 | +
|
| 132 | +In contrast, coroutines are scheduled by themselves cooperatively with the help |
| 133 | +of an event manager. The event manager lives in the same thread as the |
| 134 | +coroutines and unlike the OS scheduler that forces context switches on threads, |
| 135 | +the event manager acts only when coroutines pause themselves. A thread knows |
| 136 | +when it wants to run, but coroutines don't - only the event manager knows which |
| 137 | +coroutine should run. The event manager may only trigger the next coroutine to |
| 138 | +run after the previous coroutine yields control to wait for an event (e.g. |
| 139 | +wait for an HTTP response). This approach to achieve concurrency is called |
| 140 | +`cooperative multitasking |
| 141 | +<https://en.wikipedia.org/wiki/Cooperative_multitasking>`_. It's like this: |
| 142 | + |
| 143 | +.. code-block:: none |
| 144 | +
|
| 145 | + Coroutine 1: Let me know when event A arrives. I'm done here before that. |
| 146 | + Event manager: Okay. What about you, coroutine 2? |
| 147 | + Coroutine 2: Um I've got nothing to do here before event B. |
| 148 | + Event manager: Cool, I'll be watching. |
| 149 | + Event manager: (after a while) Hey coroutine 1, event A is here! |
| 150 | + Coroutine 1: Awesome! Let me see ... looks good, but I need event C now. |
| 151 | + Event manager: Very well. Seems event B arrived just now, coroutine 2? |
| 152 | + Coroutine 2: Oh wonderful! Let me store it in a file ... There! I'm all done. |
| 153 | + Event manager: Sweet! Since there's no sign of event C yet, I'll sleep for a while. |
| 154 | + (silence) |
| 155 | + Event manager: Damn, event C timed out! |
| 156 | + Coroutine 1: Arrrrh gotta kill myself with an exception :S |
| 157 | + Event manager: Up to you :/ |
| 158 | +
|
| 159 | +For coroutines, a task cannot be paused externally, the task can only pause |
| 160 | +itself from within. When there are a lot of coroutines, concurrency depends on |
| 161 | +each of them pausing from time to time to wait for events. If you wrote a |
| 162 | +coroutine that never paused, it would allow no concurrency at all when running |
| 163 | +because no other coroutine would have a chance to run. On the other hand, you |
| 164 | +can feel safe in the code between pauses, because no other coroutine can |
| 165 | +run at the same time to mess up shared states. That's why in the last diagram, |
| 166 | +the red bars are not interleaved like threads. |
| 167 | + |
| 168 | +.. tip:: |
| 169 | + |
| 170 | + In Python and asyncio, ``async def`` declares coroutines, ``await`` yields |
| 171 | + control to event loop (event manager). |
| 172 | + |
| 173 | + |
| 174 | +Pros and cons |
| 175 | +------------- |
| 176 | + |
| 177 | +Asynchronous I/O may handle tens of thousands of concurrent I/O operations in |
| 178 | +the same thread. This can save a lot of CPU time from context switching, and |
| 179 | +memory from multi-threading. Therefore if you are dealing with lots of I/O-bound |
| 180 | +tasks concurrently, asynchronous I/O can efficiently use limited CPU and memory to |
| 181 | +deliver greater throughput. |
| 182 | + |
| 183 | +With coroutines, you can naturally write sequential code that is cooperatively |
| 184 | +scheduled. If your business logic is complex, coroutines could greatly improve |
| 185 | +readability of asynchronous I/O code. |
| 186 | + |
| 187 | +However for a single task, asynchronous I/O can actually impair throughput. For |
| 188 | +example, for a simple ``recv()`` operation blocking I/O would just block until |
| 189 | +returning the result, but for asynchronous I/O additional steps are required: |
| 190 | +register for the read event, wait until event arrives, try to ``recv()``, repeat |
| 191 | +until a result returns, and finally feed the result to a callback. With coroutines, |
| 192 | +the framework cost is even larger. Thanks to uvloop_ this cost has been minimized |
| 193 | +in Python, but it is still additional overhead compared to raw blocking I/O. |
| 194 | + |
| 195 | +Timing in Asynchronous I/O is also less predictable because of its cooperative |
| 196 | +nature. For example, in a coroutine you may want to sleep for 1 second. However, |
| 197 | +if another coroutine received control and ran for 2 seconds, by the time we get |
| 198 | +back to the first coroutine 2 seconds have already passed. Therefore, ``sleep(1)`` |
| 199 | +means to wait for at least 1 second. In practice, you should try your best to make |
| 200 | +sure that all code between ``await`` finishes ASAP, being literally cooperative. |
| 201 | +Still, there can be code outside your control, so it is important to keep this |
| 202 | +unpredictibility of timing in mind. |
| 203 | + |
| 204 | +Finally, asynchronous programming is complicated. Writing good asynchronous code |
| 205 | +is easier said than done, and debugging it is more difficult than debugging |
| 206 | +similar synchronous code. Especially when a whole team is working on the |
| 207 | +same piece of asynchronous code, it can easily go wrong. Therefore, a general |
| 208 | +suggestion is to use asynchronous I/O carefully for I/O-bound high concurrency |
| 209 | +scenarios only. It's not a drop-in that will provide a performance boost, but |
| 210 | +more like a sharp blade for concurrency with two edges. And if you are dealing with |
| 211 | +time-critical tasks, think again to be sure. |
| 212 | + |
| 213 | + |
| 214 | +.. _uvloop: https://github.com/MagicStack/uvloop |
0 commit comments