Skip to content

runtime (gc): garbage collector redesign #1208

@niaow

Description

@niaow

Currently TinyGo has 2 real garbage collectors: conservative and extalloc. Despite being distinct from the collector called conservative, extalloc is also a conservative collector. The conservative collector was our original collector, and has been the primary collector for most of the life of TinyGo so far.

The conservative collector divides the heap into quad-pointer-width "blocks" and uses a bitmap to keep track of what "blocks" are in use.

  • pro: the metadata is extremely compact (fixed 64 bytes/KiB)
  • pro: no additional per-allocation overhead
  • pro: finding the start of an allocation is usually reasonably cheap
  • pro: it isn't terribly complicated
  • con: structures like linked lists can overfill the collector's worklist, forcing a full rescan of the heap (which makes this collector degrade to O(n^2))
  • con: the collector needs a fair amount of stack space to run
  • con: the use of bitmaps makes it inefficient on AVR, where the compiler implements many operations as loops of single bit shifts
  • con: some GC initialization code needs to be included even if the GC is not used
  • con: has no free list, and selects the first sequence of free blocks rather than the best fit

The extalloc collector was implemented to allow sharing a heap with C code and provide better memory management on hosted systems. It uses the system's malloc implementation and tracks allocations using a treap.

  • pro: doesn't acquire memory until it is needed, allowing other applications on a hosted system to use memory for other things
  • pro: shares a heap with C
  • pro: unbounded worklist - no rescans required
  • pro: the sweep phase is relatively straightforward
  • con: all systems currently using extalloc have caches, and the data structures used (linked list & treap) mean that most of the GC time will typically be spent in cache misses
  • con: extremely high per-allocation overhead (4 words = 16 bytes on 32-bit or 32 bytes on 64-bit)
  • con: O(n*log(n)) time complexity
  • con: the treap code is extremely complicated, and it does not sound like any other contributors want to touch it
  • con: GC design is fundamentally not thread safe (cannot be scaled to multicore systems)
  • con: heap does not shrink, so a memory spike will cause the GC to keep the heap big

I also designed another collector for AVR. It uses a simple linked list of allocations, and finds free memory by searching this list and recording the smallest gap.

  • pro: a lot smaller than the conservative collector
  • pro: does not use any size rounding - there is no wasted tail space
  • pro: unbounded worklist - no rescans required
  • pro: does not have any real sweep phase since there is no free list
  • con: every allocation operation needs to scan the entire allocation list (this is considered acceptable since the heap isn't that large)
  • con: lack of size rounding can potentially result in many gaps that are too small to use for any allocation

I have doing a lot of GC work, mostly motivated by a few things:

  • The GC takes up a lot of code space, especially on AVR.
  • The GC has started to run into fragmentation issues (e.g. in the MQTT demo).
  • The addition of interrupt-driven concurrency will mean that long pauses can result in seemingly random event handling latency spikes. Additionally, goroutines being moved onto the worklist means that pointers are moved during the mark phase, extending the length of the collection cycle. Ideally, we would like the collector to be fast enough for this to be infrequent.

Right now, the GC only runs when we are out of memory. This means that the allocator has to work around a bunch of garbage when we are almost out of memory, resulting in increased fragmentation. We may want to try running the GC earlier to determine whether it has a significant impact.

I think we should move towards 2 GC designs - one simplified design based on AVR, and one using a more scalable design, which is described below:

Overview

The heap is divided into:

  • 64 KiB chunks
  • 1 KiB fragments (64 fragments/chunk)
  • 8 byte blocks (128 blocks/fragment)

An allocation will always fall completely within one chunk, and is prefixed by a 4 byte header at the start of the first block. This header has 2 16-bit fields:

  • a "next" pointer; used to implement linked-list-like behavior
  • a length value (in bytes; not including header)

The body of the allocation starts immediately after this header.

Each fragment contains an address-sorted linked list of allocation headers that fall within it, which is used for look-ups during the mark phase (effectively works as a hash table with an overly simplistic hash). The head of each linked list is 1 byte, resulting in an overhead of 1 byte per kilobyte.

To handle allocating, each chunk maintains a two-dimensional free list, which is rebuilt after each sweep.

Allocation Table

To look up an allocation, the GC first finds the chunk containing the address. Then it identifies the fragment containing the address.

The address is then compared to the head of the fragment. If the head of the fragment is greater than the address (or if the fragment head is nil), then we can assume that if the space is allocated then the header will be in a previous fragment.

Once the correct fragment has been found, it can be searched linearly.

TODO: add a diagram

TODO: describe bitmap sort

TODO: allocation table format

Mark Phase

TODO

I will finish this up later

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions