-
Notifications
You must be signed in to change notification settings - Fork 996
Description
Currently TinyGo has 2 real garbage collectors: conservative and extalloc. Despite being distinct from the collector called conservative, extalloc is also a conservative collector. The conservative collector was our original collector, and has been the primary collector for most of the life of TinyGo so far.
The conservative collector divides the heap into quad-pointer-width "blocks" and uses a bitmap to keep track of what "blocks" are in use.
- pro: the metadata is extremely compact (fixed 64 bytes/KiB)
- pro: no additional per-allocation overhead
- pro: finding the start of an allocation is usually reasonably cheap
- pro: it isn't terribly complicated
- con: structures like linked lists can overfill the collector's worklist, forcing a full rescan of the heap (which makes this collector degrade to
O(n^2)) - con: the collector needs a fair amount of stack space to run
- con: the use of bitmaps makes it inefficient on AVR, where the compiler implements many operations as loops of single bit shifts
- con: some GC initialization code needs to be included even if the GC is not used
- con: has no free list, and selects the first sequence of free blocks rather than the best fit
The extalloc collector was implemented to allow sharing a heap with C code and provide better memory management on hosted systems. It uses the system's malloc implementation and tracks allocations using a treap.
- pro: doesn't acquire memory until it is needed, allowing other applications on a hosted system to use memory for other things
- pro: shares a heap with C
- pro: unbounded worklist - no rescans required
- pro: the sweep phase is relatively straightforward
- con: all systems currently using
extallochave caches, and the data structures used (linked list & treap) mean that most of the GC time will typically be spent in cache misses - con: extremely high per-allocation overhead (4 words = 16 bytes on 32-bit or 32 bytes on 64-bit)
- con:
O(n*log(n))time complexity - con: the treap code is extremely complicated, and it does not sound like any other contributors want to touch it
- con: GC design is fundamentally not thread safe (cannot be scaled to multicore systems)
- con: heap does not shrink, so a memory spike will cause the GC to keep the heap big
I also designed another collector for AVR. It uses a simple linked list of allocations, and finds free memory by searching this list and recording the smallest gap.
- pro: a lot smaller than the
conservativecollector - pro: does not use any size rounding - there is no wasted tail space
- pro: unbounded worklist - no rescans required
- pro: does not have any real sweep phase since there is no free list
- con: every allocation operation needs to scan the entire allocation list (this is considered acceptable since the heap isn't that large)
- con: lack of size rounding can potentially result in many gaps that are too small to use for any allocation
I have doing a lot of GC work, mostly motivated by a few things:
- The GC takes up a lot of code space, especially on AVR.
- The GC has started to run into fragmentation issues (e.g. in the MQTT demo).
- The addition of interrupt-driven concurrency will mean that long pauses can result in seemingly random event handling latency spikes. Additionally, goroutines being moved onto the worklist means that pointers are moved during the mark phase, extending the length of the collection cycle. Ideally, we would like the collector to be fast enough for this to be infrequent.
Right now, the GC only runs when we are out of memory. This means that the allocator has to work around a bunch of garbage when we are almost out of memory, resulting in increased fragmentation. We may want to try running the GC earlier to determine whether it has a significant impact.
I think we should move towards 2 GC designs - one simplified design based on AVR, and one using a more scalable design, which is described below:
Overview
The heap is divided into:
- 64 KiB chunks
- 1 KiB fragments (64 fragments/chunk)
- 8 byte blocks (128 blocks/fragment)
An allocation will always fall completely within one chunk, and is prefixed by a 4 byte header at the start of the first block. This header has 2 16-bit fields:
- a "next" pointer; used to implement linked-list-like behavior
- a length value (in bytes; not including header)
The body of the allocation starts immediately after this header.
Each fragment contains an address-sorted linked list of allocation headers that fall within it, which is used for look-ups during the mark phase (effectively works as a hash table with an overly simplistic hash). The head of each linked list is 1 byte, resulting in an overhead of 1 byte per kilobyte.
To handle allocating, each chunk maintains a two-dimensional free list, which is rebuilt after each sweep.
Allocation Table
To look up an allocation, the GC first finds the chunk containing the address. Then it identifies the fragment containing the address.
The address is then compared to the head of the fragment. If the head of the fragment is greater than the address (or if the fragment head is nil), then we can assume that if the space is allocated then the header will be in a previous fragment.
Once the correct fragment has been found, it can be searched linearly.
TODO: add a diagram
TODO: describe bitmap sort
TODO: allocation table format
Mark Phase
TODO
I will finish this up later