Skip to content

Commit 1ce1c9c

Browse files
committed
redo and split why-async-orm doc into 2
1 parent 6e3e4a4 commit 1ce1c9c

File tree

4 files changed

+459
-273
lines changed

4 files changed

+459
-273
lines changed

docs/.tx/config

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -259,3 +259,9 @@ source_file = _build/gettext/tutorials/announcement.pot
259259
source_lang = zh
260260
type = PO
261261

262+
[gino_1_0.explanation--async]
263+
file_filter = locale/<lang>/LC_MESSAGES/explanation/async.po
264+
source_file = _build/gettext/explanation/async.pot
265+
source_lang = en
266+
type = PO
267+

docs/explanation/async.rst

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
Asynchronous Programming 101
2+
============================
3+
4+
5+
The Story
6+
---------
7+
8+
Let's say we want to build a search engine. We'll use a single core computer to
9+
build our index. To make things simpler, our tasks are to fetch web pages
10+
(I/O operation), and process their content (CPU operation). Each task looks
11+
like this:
12+
13+
.. image:: ../images/why_single_task.png
14+
:align: center
15+
16+
We have lots of web pages to index, so we simply handle them one by one:
17+
18+
.. image:: ../images/why_throughput.png
19+
:align: center
20+
21+
Let's assume the time of each task is constant: each second, 2 tasks are done.
22+
Thus we can say what the throughput of the current system is 2 tasks/sec. How
23+
can we improve the throughput? An obvious answer is to add more CPU cores:
24+
25+
.. image:: ../images/why_multicore.png
26+
:align: center
27+
28+
This simply doubles our throughput to 4 tasks/sec, and linearly scales as we
29+
add more CPU cores, if the network is not a bottleneck. But can we improve
30+
the throughput for each CPU core? The answer is yes, we can use
31+
multi-threading:
32+
33+
.. image:: ../images/why_multithreading.png
34+
:align: center
35+
36+
Wait a second! The 2 threads barely finished 6 tasks in 2 seconds, a
37+
throughput of only 2.7 tasks/sec, much lower than 4 tasks/sec with 2 cores.
38+
What's wrong with multi-threading? From the diagram we can see:
39+
40+
* There are yellow bars taking up extra time.
41+
* The green bars can still overlap with any bar in the other thread, but
42+
* non-green bars cannot overlap with non-green bars in the other thread.
43+
44+
The yellow bars are time taken by `context switches
45+
<https://en.wikipedia.org/wiki/Context_switch>`_, a necessary part of allowing
46+
multiple threads or processes to run on a single CPU core concurrently.
47+
One CPU core can do only one thing at a time (let's assume a world without
48+
`Hyper-threading <https://en.wikipedia.org/wiki/Hyper-threading>`_ or similar),
49+
so in order to run several threads concurrently the CPU must `split its
50+
time <https://en.wikipedia.org/wiki/Time-sharing>`_ into small
51+
slices, and run a little bit of each thread within these slices. The yellow bar
52+
is the overhead for the CPU to switch context to run a different thread. The
53+
scale is a bit dramatic, but it helps with the point.
54+
55+
Wait again here, the green bars are overlapping between threads. Is the CPU
56+
doing two things at the same time? No, the CPU is doing nothing in the middle
57+
of the green bar, because it's waiting for the HTTP response (I/O). That's how
58+
multi-threading could improve the throughput to 2.7 tasks/sec, instead of
59+
decreasing it to 1.7 tasks/sec. You may try in real to run CPU-intensive
60+
tasks with multi-threading on single core, there won't be any improvement. Like
61+
the multiplexed red bars (in practice there might be more context switches
62+
depending on the task), they appear to be running at the same time, but the
63+
total time for all to finish is actually longer than running the tasks one
64+
by one. That's also why this is called concurrency instead of parallelism.
65+
66+
As you might imagine, throughput will improve less with each additional thread,
67+
until throughput begins to decrease because context switches are wasting too
68+
much time, not to mention the extra memory footprint taken by new threads. It
69+
is usually not practical to have tens of thousands of threads running on a single
70+
CPU core. How, then, is it possible to have tens of thousands of I/O-bound tasks
71+
running concurrently on a single CPU core? This is the once-famous `C10k
72+
problem <https://en.wikipedia.org/wiki/C10k_problem>`_, usually solved by
73+
asynchronous I/O:
74+
75+
.. image:: ../images/why_coroutine.png
76+
:align: center
77+
78+
.. note::
79+
80+
Asynchronous I/O and coroutines are two different things, but they usually
81+
go together. Here we will stick with coroutines for simplicity.
82+
83+
Awesome! The throughput is 3.7 tasks/sec, nearly as good as 4 tasks/sec of 2
84+
CPU cores. Though this is not real data, compared to OS threads coroutines
85+
do take much less time to context switch and have a lower memory footprint,
86+
thus making them an ideal option for the C10k problem.
87+
88+
89+
Cooperative multitasking
90+
------------------------
91+
92+
So what is a coroutine?
93+
94+
In the last diagram above, you may have noticed a difference compared to the
95+
previous diagrams: the green bars are overlapping within the same thread.
96+
That is because the in the last diagram, our code is using asynchronous I/O,
97+
whereas the previously we were using blocking I/O. As the name suggests, blocking
98+
I/O will block the thread until the I/O result is ready. Thus, there can be only
99+
one blocking I/O operation running in a thread at a time. To achieve concurrency
100+
with blocking I/O, either multi-threading or multi-processing must be used.
101+
In contrast, asynchronous I/O allows thousands (or even more) of concurrent
102+
I/O reads and writes within the same thread, with each I/O operation blocking
103+
only the coroutine performing the I/O rather than the whole thread. Like
104+
multi-threading, coroutines provide a means to have concurrency during I/O,
105+
but unlike multi-threading this concurrency occurs within a single thread.
106+
107+
Threads are scheduled by the operating system using an approach called `preemptive
108+
multitasking <https://en.wikipedia.org/wiki/Preemption_(computing)>`_. For
109+
example, in previous multi-threading diagram there was only one CPU core. When
110+
Thread 2 tried to start processing the first web page content, Thread 1 hadn't
111+
finished processing its own. The OS brutally interrupted Thread 1 and shared
112+
some resource (time) for Thread 2. But Thread 1 also needed CPU time to finish
113+
its processing at the same time, so in turn after a while the OS had to pause
114+
Thread 2 and resume Thread 1. Depending on the size of the task, such turns may
115+
happen several times, so that every thread may have a fair chance to run. It is
116+
something like this:
117+
118+
.. code-block:: none
119+
120+
Thread 1: I wanna run!
121+
OS: Okay, here you go...
122+
Thread 2: I wanna run!
123+
OS: Urh, alright one sec ... Thread 1, hold on for a while!
124+
Thread 1: Well I'm not done yet, but you are the boss.
125+
OS: It won't be long. Thread 2 it's your turn now.
126+
Thread 2: Yay! (&%#$@..+*&#)
127+
Thread 1: Can I run now?
128+
OS: Just a moment please ... Thread 2, give it a break!
129+
Thread 2: Alright ... but I really need the CPU.
130+
OS: You'll have it later. Thread 1, hurry up!
131+
132+
In contrast, coroutines are scheduled by themselves cooperatively with the help
133+
of an event manager. The event manager lives in the same thread as the
134+
coroutines and unlike the OS scheduler that forces context switches on threads,
135+
the event manager acts only when coroutines pause themselves. A thread knows
136+
when it wants to run, but coroutines don't - only the event manager knows which
137+
coroutine should run. The event manager may only trigger the next coroutine to
138+
run after the previous coroutine yields control to wait for an event (e.g.
139+
wait for an HTTP response). This approach to achieve concurrency is called
140+
`cooperative multitasking
141+
<https://en.wikipedia.org/wiki/Cooperative_multitasking>`_. It's like this:
142+
143+
.. code-block:: none
144+
145+
Coroutine 1: Let me know when event A arrives. I'm done here before that.
146+
Event manager: Okay. What about you, coroutine 2?
147+
Coroutine 2: Um I've got nothing to do here before event B.
148+
Event manager: Cool, I'll be watching.
149+
Event manager: (after a while) Hey coroutine 1, event A is here!
150+
Coroutine 1: Awesome! Let me see ... looks good, but I need event C now.
151+
Event manager: Very well. Seems event B arrived just now, coroutine 2?
152+
Coroutine 2: Oh wonderful! Let me store it in a file ... There! I'm all done.
153+
Event manager: Sweet! Since there's no sign of event C yet, I'll sleep for a while.
154+
(silence)
155+
Event manager: Damn, event C timed out!
156+
Coroutine 1: Arrrrh gotta kill myself with an exception :S
157+
Event manager: Up to you :/
158+
159+
For coroutines, a task cannot be paused externally, the task can only pause
160+
itself from within. When there are a lot of coroutines, concurrency depends on
161+
each of them pausing from time to time to wait for events. If you wrote a
162+
coroutine that never paused, it would allow no concurrency at all when running
163+
because no other coroutine would have a chance to run. On the other hand, you
164+
can feel safe in the code between pauses, because no other coroutine can
165+
run at the same time to mess up shared states. That's why in the last diagram,
166+
the red bars are not interleaved like threads.
167+
168+
.. tip::
169+
170+
In Python and asyncio, ``async def`` declares coroutines, ``await`` yields
171+
control to event loop (event manager).
172+
173+
174+
Pros and cons
175+
-------------
176+
177+
Asynchronous I/O may handle tens of thousands of concurrent I/O operations in
178+
the same thread. This can save a lot of CPU time from context switching, and
179+
memory from multi-threading. Therefore if you are dealing with lots of I/O-bound
180+
tasks concurrently, asynchronous I/O can efficiently use limited CPU and memory to
181+
deliver greater throughput.
182+
183+
With coroutines, you can naturally write sequential code that is cooperatively
184+
scheduled. If your business logic is complex, coroutines could greatly improve
185+
readability of asynchronous I/O code.
186+
187+
However for a single task, asynchronous I/O can actually impair throughput. For
188+
example, for a simple ``recv()`` operation blocking I/O would just block until
189+
returning the result, but for asynchronous I/O additional steps are required:
190+
register for the read event, wait until event arrives, try to ``recv()``, repeat
191+
until a result returns, and finally feed the result to a callback. With coroutines,
192+
the framework cost is even larger. Thanks to uvloop_ this cost has been minimized
193+
in Python, but it is still additional overhead compared to raw blocking I/O.
194+
195+
Timing in Asynchronous I/O is also less predictable because of its cooperative
196+
nature. For example, in a coroutine you may want to sleep for 1 second. However,
197+
if another coroutine received control and ran for 2 seconds, by the time we get
198+
back to the first coroutine 2 seconds have already passed. Therefore, ``sleep(1)``
199+
means to wait for at least 1 second. In practice, you should try your best to make
200+
sure that all code between ``await`` finishes ASAP, being literally cooperative.
201+
Still, there can be code outside your control, so it is important to keep this
202+
unpredictibility of timing in mind.
203+
204+
Finally, asynchronous programming is complicated. Writing good asynchronous code
205+
is easier said than done, and debugging it is more difficult than debugging
206+
similar synchronous code. Especially when a whole team is working on the
207+
same piece of asynchronous code, it can easily go wrong. Therefore, a general
208+
suggestion is to use asynchronous I/O carefully for I/O-bound high concurrency
209+
scenarios only. It's not a drop-in that will provide a performance boost, but
210+
more like a sharp blade for concurrency with two edges. And if you are dealing with
211+
time-critical tasks, think again to be sure.
212+
213+
214+
.. _uvloop: https://github.com/MagicStack/uvloop

0 commit comments

Comments
 (0)