runtime (gc_blocks.go): use a linked stack to scan marked objects #5102

niaow · 2025-11-30T01:30:01Z

The blocks GC originally used a fixed-size stack to hold objects to scan. When this stack overflowed, the GC would fully rescan all marked objects. This could cause the GC to degrade to O(n^2) when scanning large linked data structures.

Instead of using a fixed-size stack, we now add a pointer field to the start of each object. This pointer field is used to implement an unbounded linked stack. This also consolidates the heap object scanning into one place, which simplifies the process.

This comes at the cost of introducing a pointer field to the start of the object, plus the cost of aligning the result. This translates to:

16 bytes of overhead on x86/arm64 with the conservative collector
0 bytes of overhead on x86/arm64 with the precise collector (the layout field cost gets aligned up to 16 bytes anyway)
8 bytes of overhead on other 64-bit systems
4 bytes of overhead on 32-bit systems
2 bytes of overhead on AVR

niaow · 2025-11-30T01:30:32Z

This includes the commit from #5101, so that should be merged first.

niaow · 2025-11-30T01:32:31Z

This improves performance significantly:

                    │ conservative.txt │       conservative-linked.txt       │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000        29.10m ± 2%   24.18m ± 2%  -16.91% (p=0.000 n=20)   20.40m ± 2%  -29.89% (p=0.000 n=20)

                    │ conservative.txt │       conservative-linked.txt        │              boehm.txt               │
                    │       B/s        │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000       2.127Mi ± 1%   2.551Mi ± 2%  +19.96% (p=0.000 n=20)   3.028Mi ± 2%  +42.38% (p=0.000 n=20)

                    │ precise.txt  │         precise-linked.txt          │              boehm.txt              │
                    │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000   30.94m ± 15%   24.73m ± 3%  -20.08% (p=0.000 n=20)   20.40m ± 2%  -34.06% (p=0.000 n=20)

                    │  precise.txt  │          precise-linked.txt          │              boehm.txt               │
                    │      B/s      │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000   1.993Mi ± 17%   2.499Mi ± 3%  +25.36% (p=0.000 n=20)   3.028Mi ± 2%  +51.91% (p=0.000 n=20)

deadprogram · 2025-11-30T08:47:54Z

@niaow please rebase this PR against dev now that #5101 has been merged. Thank you!

The blocks GC originally used a fixed-size stack to hold objects to scan. When this stack overflowed, the GC would fully rescan all marked objects. This could cause the GC to degrade to O(n^2) when scanning large linked data structures. Instead of using a fixed-size stack, we now add a pointer field to the start of each object. This pointer field is used to implement an unbounded linked stack. This also consolidates the heap object scanning into one place, which simplifies the process. This comes at the cost of introducing a pointer field to the start of the object, plus the cost of aligning the result. This translates to: - 16 bytes of overhead on x86/arm64 with the conservative collector - 0 bytes of overhead on x86/arm64 with the precise collector (the layout field cost gets aligned up to 16 bytes anyway) - 8 bytes of overhead on other 64-bit systems - 4 bytes of overhead on 32-bit systems - 2 bytes of overhead on AVR

Loop over valid pointer locations in heap objects instead of checking if each location is valid. The conservative scanning code is now shared between markRoots and the heap scan. This also removes the ending alignment requirement from markRoots, since the new scan* functions do not require an aligned length. This requirement was occasionally violated by the linux global marking code. This saves some code space and has negligible impact on performance.

niaow · 2025-11-30T17:59:18Z

I also decided to add the scanning logic rework commit to this PR because it is closely related.

deadprogram

I think the small header justifies the speed increase. From my initial benchmarks it does appear to have an impact on the larger nested/linked data struct.

deadprogram · 2025-12-04T13:28:41Z

Here my my benchmarks from tinybench:

Before

tinygo version 0.40.0-dev-9404bb87 linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1482            805580 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1054           1065539 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               61          18088050 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1393121093 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2772118003 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  207           5809472 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 135           9688843 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  63          16362101 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            72          18954161 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           38          30118819 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           21          55955333 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          46763151 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         273176370 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1287202089 ns/op

After

tinygo version 0.40.0-dev-9c172e44 linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1629            775956 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1119           1040134 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               60          19119020 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1387364986 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2770401936 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  199           5611733 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 130           8831256 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  81          15731963 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            61          19847993 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           37          30365391 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           20          56060860 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          46807306 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         271491941 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1283144754 ns/op

deadprogram · 2025-12-04T13:29:54Z

Anyone else have any feedback before we merge?

dgryski

The gc_blocks and gc_conservative work LGTM, but I wouldn't mind a second set of eyes on the gc_precise changes.

deadprogram · 2025-12-05T10:16:33Z

The gc_blocks and gc_conservative work LGTM, but I wouldn't mind a second set of eyes on the gc_precise changes.

I gave it a few more goings over and still LGTM. Now merging, thanks @niaow for all this awesome work!

niaow added 2 commits November 30, 2025 12:56

niaow force-pushed the blocks-linked-list branch from cfbf6c9 to 11d283d Compare November 30, 2025 17:58

niaow mentioned this pull request Dec 2, 2025

runtime (gc_blocks.go): use best-fit allocation #5105

Open

deadprogram approved these changes Dec 4, 2025

View reviewed changes

dgryski approved these changes Dec 4, 2025

View reviewed changes

deadprogram merged commit 26ac03a into tinygo-org:dev Dec 5, 2025
19 checks passed

deadprogram mentioned this pull request Dec 5, 2025

runtime (gc_blocks.go): make sweep branchless #5104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime (gc_blocks.go): use a linked stack to scan marked objects #5102

runtime (gc_blocks.go): use a linked stack to scan marked objects #5102

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

deadprogram commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

deadprogram left a comment

Uh oh!

deadprogram commented Dec 4, 2025

Uh oh!

deadprogram commented Dec 4, 2025

Uh oh!

dgryski left a comment

Uh oh!

deadprogram commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runtime (gc_blocks.go): use a linked stack to scan marked objects #5102

runtime (gc_blocks.go): use a linked stack to scan marked objects #5102

Uh oh!

Conversation

niaow commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

deadprogram commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

deadprogram left a comment

Choose a reason for hiding this comment

Uh oh!

deadprogram commented Dec 4, 2025

Before

After

Uh oh!

deadprogram commented Dec 4, 2025

Uh oh!

dgryski left a comment

Choose a reason for hiding this comment

Uh oh!

deadprogram commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants