Skip to content

Commit edee20d

Browse files
committed
Rewrite README
1 parent bdde3f0 commit edee20d

File tree

1 file changed

+181
-22
lines changed

1 file changed

+181
-22
lines changed

README.rst

Lines changed: 181 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -52,11 +52,11 @@ Async:
5252

5353
.. code-block:: python
5454
55-
from anyio import sleep # AsyncSonyFlake supports both asyncio and trio
55+
import anyio
5656
from sonyflake_turbo import AsyncSonyFlake, SonyFlake
5757
5858
sf = SonyFlake(0x1337, 0xCAFE, start_time=1749081600)
59-
asf = AsyncSonyFlake(sf, sleep)
59+
asf = AsyncSonyFlake(sf, sleep=anyio.sleep) # defaults to asyncio.sleep
6060
6161
print("one", await asf)
6262
print("n", await asf(5))
@@ -68,23 +68,182 @@ Async:
6868
Important Notes
6969
===============
7070

71-
SonyFlake algorithm produces IDs at rate 256 IDs per 10msec per 1 Machine ID.
72-
One obvious way to increase the throughput is to use multiple generators with
73-
different Machine IDs. This library provides a way to do exactly that by
74-
passing multiple Machine IDs to the constructor of the `SonyFlake` class.
75-
Generated IDs are non-repeating and are always increasing. But be careful! You
76-
should be conscious about assigning Machine IDs to different processes and/or
77-
machines to avoid collisions. This library does not come with any Machine ID
78-
management features, so it's up to you to figure this out.
79-
80-
This library has limited free-threaded mode support. It won't crash, but
81-
you won't get much performance gain from multithreaded usage. Consider
82-
creating generators per thread instead of sharing them across multiple
83-
threads.
84-
85-
This library also contains pure-Python implementation as a fallback in case of
86-
C extension unavailability (e.g. with PyPy or when installed with
87-
``--no-binary`` flag).
71+
Vanilla SonyFlake Difference
72+
----------------------------
73+
74+
In vanilla SonyFlake, whenever counter overflows, it simply waits for the next
75+
10ms window. Which severely limits the throughput. I.e. single generator
76+
produces 256ids/10ms.
77+
78+
Turbo version is basically the same as vanilla SonyFlake, except it accepts
79+
more than one Machine ID in constructor args. On counter overflow, it advances
80+
to the next "unexhausted" Machine ID and resumes the generation. Waiting for
81+
the next 10ms window happens only when all of the Machine IDs were exhausted.
82+
83+
This behavior is not much different from having multiple vanilla ID generators
84+
in parallel, but by doing so we ensure produced IDs are always monotonically
85+
increasing (per generator instance) and avoid potential concurrency issues
86+
(by not doing concurrency).
87+
88+
Few other features in comparison to other SonyFlake implementations found in
89+
the wild:
90+
91+
* Optional C extension module, for extra performance in CPython.
92+
* Async-framework-agnostic wrapper.
93+
* Thread-safe. Also has free-threading/nogil support.
94+
95+
.. note::
96+
97+
Safe for concurrent use; internal locking ensures correctness. Sleeps are
98+
always done after internal state updates.
99+
100+
.. _Locks: https://docs.python.org/3/library/threading.html#lock-objects
101+
102+
Machine IDs
103+
-----------
104+
105+
Machine ID is a 16 bit integer in range ``0x0000`` to ``0xFFFF``. Machine IDs
106+
are encoded as part of the SonyFlake ID:
107+
108+
+----+-----------------+------------+---------+
109+
| | Time | Machine ID | Counter |
110+
+====+=================+============+=========+
111+
| 0x | 0874AD4993 [#]_ | CAFE | 04 |
112+
+----+-----------------+------------+---------+
113+
114+
SonyFlake IDs, in spirit, are UUIDv6_, but compressed down to 64 bit. But
115+
unfortunately, we do not have luxury of having 48 bits for encoding node id
116+
(UUID equivalent of SonyFlake's Machine ID). UUID standard proposes to use
117+
pseudo-random value for this field, which is sub-optimal for our case due to
118+
high risk of collisions.
119+
120+
Vanilla SonyFlake, on the other hand, used lower 16 bits of the private IP
121+
address. Which is sort of works, but has two major drawbacks:
122+
123+
1. It assumes you have *exactly one* ID generator per machine in your network.
124+
2. You're leaking some of your infrastructure info.
125+
126+
In the modern world (k8s, "lambdas", etc...), both of these fall apart:
127+
128+
1. Single machine often runs multiple different processes and/or threads.
129+
More often than not they're isolated enough to successfully coordinate
130+
ID generation.
131+
2. Security aspect aside, container IPs within cluster network are not
132+
something globally unique, especially if trimmed down to 16 bit.
133+
134+
Solving this issue is up to you, as a developer. This particular library does
135+
not include Machine ID management logic, so you are responsible for
136+
coordinating Machine IDs in your deployment.
137+
138+
Task is not trivial, but neither is impossible. Here are a few ideas:
139+
140+
* Coordinate ID assignment via something like etcd_ or ZooKeeper_ using lease_
141+
pattern. Optimal, but a bit bothersome to implement.
142+
* Reinvent Twitter's SnowFlake_ by having a centralized service/sidecar. Extra
143+
round-trips SonyFlake intended to avoid.
144+
* Assign Machine IDs manually. DevOps team will hate you.
145+
* Use random Machine IDs. ``If I ignore it, maybe it will go away.jpg``
146+
147+
But nevertheless, it has one helper class: ``MachineIDLCG``. This is a
148+
primitive LCG_-based 16 bit PRNG. It is intended to be used in tests, or in
149+
situations where concurrency is not a problem (e.g. desktop or CLI apps).
150+
You can also reuse it for generating IDs for a lease to avoid congestion when
151+
going etcd/ZooKeeper route.
152+
153+
How many Machine IDs you want to allocate per generator is something you
154+
should figure out on your own. Here's some numbers for you to start
155+
(generating 1 million SonyFlake IDs):
156+
157+
+--------+-------------+
158+
| Time | Machine IDs |
159+
+========+=============+
160+
| 1.22s | 32 |
161+
+--------+-------------+
162+
| 2.44s | 16 |
163+
+--------+-------------+
164+
| 4.88s | 8 |
165+
+--------+-------------+
166+
| 9.76s | 4 |
167+
+--------+-------------+
168+
| 19.53s | 2 |
169+
+--------+-------------+
170+
| 39.06s | 1 |
171+
+--------+-------------+
172+
173+
.. [#] 1409529600 + 0x874AD4993 / 100 = 2026-03-05T09:15:19.87Z
174+
.. _UUIDv6: https://www.rfc-editor.org/rfc/rfc9562.html#name-uuid-version-6
175+
.. _etcd: https://etcd.io/
176+
.. _ZooKeeper: https://zookeeper.apache.org/
177+
.. _SnowFlake: https://en.wikipedia.org/wiki/Snowflake_ID
178+
.. _lease: https://martinfowler.com/articles/patterns-of-distributed-systems/lease.html
179+
.. _LCG: https://en.wikipedia.org/wiki/Linear_congruential_generator
180+
181+
Clock Rollback
182+
--------------
183+
184+
There is no logic to handle clock rollbacks or drift at the moment. If clock
185+
moves backward, it will ``sleep()`` (``await sleep()`` in async wrapper)
186+
until time catches up to last timestamp.
187+
188+
Start Time
189+
----------
190+
191+
SonyFlake ID has 39 bits dedicated for the time component with a resolution of
192+
10ms. The time is stored relative to ``start_time``. By default it is
193+
1409529600 (``2014-09-01T00:00:00Z``), but you may want to define your own
194+
"epoch".
195+
196+
Motivation
197+
----------
198+
199+
Sometimes you have to bear with consequences of decisions you've made long
200+
time ago. On a project I was leading, I made a decision to utilize SonyFlake.
201+
Everything was fine until we needed to ingest a lot of data, very quickly.
202+
203+
A flame graph showed we were sleeping way too much. The culprit was
204+
SonyFlake library we were using at that time. Some RTFM later, it was revealed
205+
that the problem was somewhere between the chair and keyboard.
206+
207+
Solution was found rather quickly: just instantiate more generators and cycle
208+
through them about every 256 IDs. Nothing could go wrong, right? Aside from
209+
fact that hack was of questionable quality, it did work.
210+
211+
Except, we've got hit by `Hyrum's Law`_. Unintentional side effect of the hack
212+
above was that IDs lost its "monotonically increasing" property [#]_. Ofc, some
213+
of our and other team's code were dependent on this SonyFlake's feature. Duh.
214+
215+
Adding even more workarounds like pre-generate IDs, sort them and ingest was
216+
a compelling idea, but I did not feel right. Hence, this library was born.
217+
218+
.. [#] E.g. if you cycle through generators with Machine IDs 0xCAFE and 0x1337
219+
You may get the following IDs: ``0x0874b2a7a0cafe00``,
220+
``0x0874b2a7a0133700``. Even though there are no collisions, sorting
221+
them will result in a different order (vs order they've been generated)
222+
.. _Hyrum's Law: https://www.hyrumslaw.com/
223+
224+
Why should I use it?
225+
--------------------
226+
227+
If you're starting a new project, please use UUIDv7_. It is superior to
228+
SonyFlake in almost every way. It is an internet standard (RFC 9562), it is
229+
already available in various languages' standard libraries and is supported by
230+
popular databases (PostgreSQL, MariaDB, etc...).
231+
232+
Otherwise you might want to use it for one of the following reasons:
233+
234+
* You already use it and encountered similar problems mentioned in
235+
`Motivation`_ section.
236+
* You want to avoid extra round-trips to fetch IDs.
237+
* Usage of UUIDs is not feasible (legacy codebase, db indexes limited to 64
238+
bit integers, etc...) but you still want to benefit from index
239+
locality/strict global ordering.
240+
* As a cheap way to reduce predicability of IDOR_ attacks.
241+
* Architecture lunatism is still strong within you and you want your code to
242+
be DDD-like (e.g. being able to reference an entity before it is stored in
243+
DB).
244+
245+
.. _UUIDv7: https://www.rfc-editor.org/rfc/rfc9562.html#name-uuid-version-7
246+
.. _IDOR: https://cheatsheetseries.owasp.org/cheatsheets/Insecure_Direct_Object_Reference_Prevention_Cheat_Sheet.html
88247

89248
Development
90249
===========
@@ -102,16 +261,16 @@ Run tests:
102261

103262
.. code-block:: sh
104263
105-
py.test
264+
pytest
106265
107-
Building wheels:
266+
Build wheels:
108267

109268
.. code-block:: sh
110269
111270
pip install cibuildwheel
112271
cibuildwheel
113272
114-
Building ``py3-none-any`` wheel (without C extension):
273+
Build a ``py3-none-any`` wheel (without the C extension):
115274

116275
.. code-block:: sh
117276

0 commit comments

Comments
 (0)