@@ -270,8 +270,8 @@ result. This has two primary good effects:
270270
271271Computing minrun
272272----------------
273- If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
274- whole array then; it's hard to beat that given the overheads of trying
273+ If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
274+ whole array then; it's hard to beat that given the overheads of trying
275275something fancier (see note BINSORT).
276276
277277When N is a power of 2, testing on random data showed that minrun values of
@@ -288,7 +288,6 @@ that 32 isn't a good choice for the general case! Consider N=2112:
288288
289289>>> divmod(2112, 32)
290290(66, 0)
291- >>>
292291
293292If the data is randomly ordered, we're very likely to end up with 66 runs
294293each of length 32. The first 64 of these trigger a sequence of perfectly
@@ -301,22 +300,94 @@ to get 64 elements into place).
301300If we take minrun=33 in this case, then we're very likely to end up with 64
302301runs each of length 33, and then all merges are perfectly balanced. Better!
303302
304- What we want to avoid is picking minrun such that in
303+ The original code used a cheap heuristic to pick a minrun that avoided the
304+ very worst cases of imbalance for the final merge, but "pretty bad" cases
305+ still existed.
305306
306- q, r = divmod(N, minrun)
307+ In 2025, Stefan Pochmann found a much better approach, based on letting minrun
308+ vary a bit from one run to the next. Under his scheme, at _all_ levels of the
309+ merge tree:
307310
308- q is a power of 2 and r>0 (then the last merge only gets r elements into
309- place, and r < minrun is small compared to N), or q a little larger than a
310- power of 2 regardless of r (then we've got a case similar to "2112", again
311- leaving too little work for the last merge to do) .
311+ - The number of runs is a power of 2.
312+ - At most two different run lengths appear.
313+ - When two do appear, the smaller is one less than the larger.
314+ - The lengths of run pairs merged never differ by more than one .
312315
313- Instead we pick a minrun in range(MAX_MINRUN / 2, MAX_MINRUN + 1) such that
314- N/minrun is exactly a power of 2, or if that isn't possible, is close to, but
315- strictly less than, a power of 2. This is easier to do than it may sound:
316- take the first log2(MAX_MINRUN) bits of N, and add 1 if any of the remaining
317- bits are set. In fact, that rule covers every case in this section, including
318- small N and exact powers of 2; merge_compute_minrun() is a deceptively simple
319- function.
316+ So, in all respects, as perfectly balanced as possible.
317+
318+ For the 2112 case, that also keeps minrun at 33, but we were lucky there
319+ that 2112 is 33 times a power of 2. The new approach doesn't rely on luck.
320+
321+ For example, with 315 random elements, the old scheme uses fixed minrun=40 and
322+ produces runs of length 40, except for the last. The new scheme produces a
323+ mix of lengths 39 and 40:
324+
325+ old: 40 40 40 40 40 40 40 35
326+ new: 39 39 40 39 39 40 39 40
327+
328+ Both schemes produce eight runs, a power of 2. That's good for a balanced
329+ merge tree. But the new scheme allows merges where left and right length
330+ never differ by more than 1:
331+
332+ 39 39 40 39 39 40 39 40
333+ 78 79 79 79
334+ 157 158
335+ 315
336+
337+ (This shows merges downward, e.g., two runs of length 39 are merged and
338+ become a run of length 78.)
339+
340+ With larger lists, the old scheme can get even more unbalanced. For example,
341+ with 32769 elements (that's 2**15 + 1), it uses minrun=33 and produces 993
342+ runs (of length 33). That's not even a power of 2. The new scheme instead
343+ produces 1024 runs, all with length 32 except for the last one with length 33.
344+
345+ How does it work? Ideally, all runs would be exactly equally long. For the
346+ above example, each run would have 315/8 = 39.375 elements. Which of course
347+ doesn't work. But we can get close:
348+
349+ For the first run, we'd like 39.375 elements. Since that's impossible, we
350+ instead use 39 (the floor) and remember the current leftover fraction 0.375.
351+ For the second run, we add 0.375 + 39.375 = 39.75. Again impossible, so we
352+ instead use 39 and remember 0.75. For the third run, we add 0.75 + 39.375 =
353+ 40.125. This time we get 40 and remember 0.125. And so on. Here's a Python
354+ generator doing that:
355+
356+ def gen_minruns_with_floats(n):
357+ mr = n
358+ while mr >= MAX_MINRUN:
359+ mr /= 2
360+
361+ mr_current = 0
362+ while True:
363+ mr_current += mr
364+ yield int(mr_current)
365+ mr_current %= 1
366+
367+ But while all arithmetic here can be done exactly using binery floating point,
368+ floats have less precision that a Py_ssize_t, and mixing floats with ints is
369+ needlessly expensive anyway.
370+
371+ So here's an integer version, where the internal numbers are scaled up by
372+ 2**e, or rather not divided by 2**e. Instead, only each yielded minrun gets
373+ divided (by right-shifting). For example instead of adding 39.375 and
374+ reducing modulo 1, it just adds 315 and reduces modulo 8. And always divides
375+ by 8 to get each actual minrun value:
376+
377+ def gen_minruns_simpler(n):
378+ e = 0
379+ while (n >> e) >= MAX_MINRUN:
380+ e += 1
381+ mask = (1 << e) - 1
382+
383+ mr_current = 0
384+ while True:
385+ mr_current += n
386+ yield mr_current >> e
387+ mr_current &= mask
388+
389+ See note MINRUN CODE for a full implementation and a driver that exhaustively
390+ verifies the claims above for all list lengths through 2 million.
320391
321392
322393The Merge Pattern
@@ -820,3 +891,75 @@ partially mitigated by pre-scanning the data to determine whether the data is
820891homogeneous with respect to type. If so, it is sometimes possible to
821892substitute faster type-specific comparisons for the slower, generic
822893PyObject_RichCompareBool.
894+
895+ MINRUN CODE
896+ from itertools import accumulate
897+ try:
898+ from itertools import batched
899+ except ImportError:
900+ from itertools import islice
901+ def batched(xs, k):
902+ it = iter(xs)
903+ while chunk := tuple(islice(it, k)):
904+ yield chunk
905+
906+ MAX_MINRUN = 64
907+
908+ def gen_minruns(n):
909+ # In listobject.c, initialization is done in merge_init(), and
910+ # the body of the loop in minrun_next().
911+ mr_e = 0
912+ while (n >> mr_e) >= MAX_MINRUN:
913+ mr_e += 1
914+ mr_mask = (1 << mr_e) - 1
915+
916+ mr_current = 0
917+ while True:
918+ mr_current += n
919+ yield mr_current >> mr_e
920+ mr_current &= mr_mask
921+
922+ def chew(n, show=False):
923+ if n < 1:
924+ return
925+
926+ sizes = []
927+ tot = 0
928+ for size in gen_minruns(n):
929+ sizes.append(size)
930+ tot += size
931+ if tot >= n:
932+ break
933+ assert tot == n
934+ print(n, len(sizes))
935+
936+ small, large = MAX_MINRUN // 2, MAX_MINRUN
937+ while len(sizes) > 1:
938+ assert not len(sizes) & 1
939+ assert len(sizes).bit_count() == 1 # i.e., power of 2
940+ assert sum(sizes) == n
941+ assert min(sizes) >= min(n, small)
942+ assert max(sizes) <= large
943+
944+ d = set(sizes)
945+ assert len(d) <= 2
946+ if len(d) == 2:
947+ lo, hi = sorted(d)
948+ assert lo + 1 == hi
949+
950+ mr = n / len(sizes)
951+ for i, s in enumerate(accumulate(sizes, initial=0)):
952+ assert int(mr * i) == s
953+
954+ newsizes = []
955+ for a, b in batched(sizes, 2):
956+ assert abs(a - b) <= 1
957+ newsizes.append(a + b)
958+ sizes = newsizes
959+ smsll = large
960+ large *= 2
961+
962+ assert sizes[0] == n
963+
964+ for n in range(2_000_001):
965+ chew(n)
0 commit comments