Skip to content

Conversation

@niaow
Copy link
Member

@niaow niaow commented Nov 30, 2025

The blocks GC originally used a fixed-size stack to hold objects to scan. When this stack overflowed, the GC would fully rescan all marked objects. This could cause the GC to degrade to O(n^2) when scanning large linked data structures.

Instead of using a fixed-size stack, we now add a pointer field to the start of each object. This pointer field is used to implement an unbounded linked stack. This also consolidates the heap object scanning into one place, which simplifies the process.

This comes at the cost of introducing a pointer field to the start of the object, plus the cost of aligning the result. This translates to:

  • 16 bytes of overhead on x86/arm64 with the conservative collector
  • 0 bytes of overhead on x86/arm64 with the precise collector (the layout field cost gets aligned up to 16 bytes anyway)
  • 8 bytes of overhead on other 64-bit systems
  • 4 bytes of overhead on 32-bit systems
  • 2 bytes of overhead on AVR

@niaow
Copy link
Member Author

niaow commented Nov 30, 2025

This includes the commit from #5101, so that should be merged first.

@niaow
Copy link
Member Author

niaow commented Nov 30, 2025

This improves performance significantly:

                    │ conservative.txt │       conservative-linked.txt       │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000        29.10m ± 2%   24.18m ± 2%  -16.91% (p=0.000 n=20)   20.40m ± 2%  -29.89% (p=0.000 n=20)

                    │ conservative.txt │       conservative-linked.txt        │              boehm.txt               │
                    │       B/s        │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000       2.127Mi ± 1%   2.551Mi ± 2%  +19.96% (p=0.000 n=20)   3.028Mi ± 2%  +42.38% (p=0.000 n=20)
                    │ precise.txt  │         precise-linked.txt          │              boehm.txt              │
                    │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000   30.94m ± 15%   24.73m ± 3%  -20.08% (p=0.000 n=20)   20.40m ± 2%  -34.06% (p=0.000 n=20)

                    │  precise.txt  │          precise-linked.txt          │              boehm.txt               │
                    │      B/s      │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000   1.993Mi ± 17%   2.499Mi ± 3%  +25.36% (p=0.000 n=20)   3.028Mi ± 2%  +51.91% (p=0.000 n=20)

@deadprogram
Copy link
Member

@niaow please rebase this PR against dev now that #5101 has been merged. Thank you!

The blocks GC originally used a fixed-size stack to hold objects to scan.
When this stack overflowed, the GC would fully rescan all marked objects.
This could cause the GC to degrade to O(n^2) when scanning large linked data structures.

Instead of using a fixed-size stack, we now add a pointer field to the start of each object.
This pointer field is used to implement an unbounded linked stack.
This also consolidates the heap object scanning into one place, which simplifies the process.

This comes at the cost of introducing a pointer field to the start of the object, plus the cost of aligning the result.
This translates to:
- 16 bytes of overhead on x86/arm64 with the conservative collector
- 0 bytes of overhead on x86/arm64 with the precise collector (the layout field cost gets aligned up to 16 bytes anyway)
- 8 bytes of overhead on other 64-bit systems
- 4 bytes of overhead on 32-bit systems
- 2 bytes of overhead on AVR
Loop over valid pointer locations in heap objects instead of checking if each location is valid.
The conservative scanning code is now shared between markRoots and the heap scan.

This also removes the ending alignment requirement from markRoots, since the new scan* functions do not require an aligned length.
This requirement was occasionally violated by the linux global marking code.

This saves some code space and has negligible impact on performance.
@niaow niaow force-pushed the blocks-linked-list branch from cfbf6c9 to 11d283d Compare November 30, 2025 17:58
@niaow
Copy link
Member Author

niaow commented Nov 30, 2025

I also decided to add the scanning logic rework commit to this PR because it is closely related.

Copy link
Member

@deadprogram deadprogram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the small header justifies the speed increase. From my initial benchmarks it does appear to have an impact on the larger nested/linked data struct.

@deadprogram
Copy link
Member

Here my my benchmarks from tinybench:

Before

tinygo version 0.40.0-dev-9404bb87 linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1482            805580 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1054           1065539 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               61          18088050 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1393121093 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2772118003 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  207           5809472 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 135           9688843 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  63          16362101 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            72          18954161 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           38          30118819 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           21          55955333 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          46763151 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         273176370 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1287202089 ns/op

After

tinygo version 0.40.0-dev-9c172e44 linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1629            775956 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1119           1040134 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               60          19119020 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1387364986 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2770401936 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  199           5611733 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 130           8831256 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  81          15731963 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            61          19847993 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           37          30365391 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           20          56060860 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          46807306 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         271491941 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1283144754 ns/op

@deadprogram
Copy link
Member

Anyone else have any feedback before we merge?

Copy link
Member

@dgryski dgryski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gc_blocks and gc_conservative work LGTM, but I wouldn't mind a second set of eyes on the gc_precise changes.

@deadprogram
Copy link
Member

The gc_blocks and gc_conservative work LGTM, but I wouldn't mind a second set of eyes on the gc_precise changes.

I gave it a few more goings over and still LGTM. Now merging, thanks @niaow for all this awesome work!

@deadprogram deadprogram merged commit 26ac03a into tinygo-org:dev Dec 5, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants