The Serenity of a Cup of Green Tea, Stolen by the Go Runtime: GreenTea GC
Why a New GC Again?
Why the Existing GC?
The current Go GC can be described with the following terms:
- Concurrent tri-color mark and sweep
- It tracks the state of objects using three colors (white, gray, black).
- White: Objects not yet visited.
- Gray: Objects visited but whose child objects have not all been visited.
- Black: Objects whose all child objects have been visited.
- After the traversal, all white objects are collected.
- It uses a Write Barrier to protect newly created memory during GC.
- The time taken to turn the Write Barrier On/Off constitutes a significant portion of what is commonly referred to as Stop-The-World (hereinafter STW).
- This process is performed concurrently.
- It tracks the state of objects using three colors (white, gray, black).
- No Compaction
- Go's GC does not perform compaction.
- Memory fragmentation may occur.
- However, objects smaller than 32KB utilize the per-P cache of the Go language's memory allocator to minimize fragmentation.
- Non-Generational
- Go's GC does not manage objects by generation.
- All objects belong to the same generation.
- Escape Analysis
- Go determines whether an object is allocated on the heap or the stack through Escape Analysis.
- Roughly speaking, if dangling pointers or interfaces are used, the object can be considered to be allocated on the heap.
The Crucial Point
Go's GC explores all objects from the root and performs a tri-color mark. This process can be succinctly described as a "topological sort graph concurrent traversal algorithm." However, each object is highly likely to reside in different memory regions. To put it simply,
- Let's assume there are two memory regions approximately 32MB apart.
- Five objects are allocated in each of these two memory regions.
- Objects A, B, C are in region 1.
- Objects D, E are in region 2.
- The object reference order is
A -> D -> B -> E -> C
. - When GC starts, it visits A, then D, then B, then E, and then C.
- At this point, since A and D reside in different memory regions, the memory region must be traversed when visiting A and then D.
- This process incurs a so-called memory jump cost, involving moving between memory regions and filling new cache lines. A significant portion of the CPU time utilized by the application may be allocated to such GC operations.
- This cost persists throughout the GC execution.
So, What is the Problem?
This GC behavior is particularly problematic in the following scenarios:
- When there are many cores and large memory
- Go's GC is fundamentally performed concurrently.
- However, if the memory region is extensive and objects are dispersed, multiple cores will concurrently search for objects in various memory spaces.
- In this process, if the bus between the CPU and memory is not sufficiently large, memory bandwidth can become a bottleneck.
- Furthermore, due to the physical distance, each traversal will entail a comparatively significant delay.
- When there are numerous small objects
- If you operate a tree or graph with significant depth or many children in Go, the GC must traverse all these objects.
- If objects are dispersed across different memory regions in this process, a delay will occur due to the first reason.
- Moreover, if there are many small objects, the GC must traverse all of them, meaning the CPU cores will allocate and perform a considerable amount of capacity for GC during its execution.
To address these issues, the Golang team has announced GreenTea GC.
You No Longer Have the Luxury of More Green Tea
What Takes Away Your Green Tea?
The area identified for applying the fastest solution to the existing GC appears to be memory locality. That is, minimizing memory jumps by ensuring objects are located close to each other. However, the patterns of programmers writing code cannot be enforced, and object allocation is unpredictable depending on the workflow.
Perhaps this is why the method chosen by the Golang team is Memory Span.
What is Memory Span?
A memory span is a relatively large memory area allocated for allocating small objects. The new GC, named GreenTea GC, performs garbage collection on these memory spans. The detailed operation is almost identical to the existing Tri-Color Mark and Sweep.
Operation of GreenTea GC
First, GreenTea GC allocates memory spans. As previously described, the size is a somewhat substantial 8KiB. Within this, objects up to 512 bytes in size can be allocated. This size is roughly that of a node in a tree or graph, as given in examples, or the size of a typical struct that escapes to the heap, making it difficult for an object to be larger than this. Objects smaller than 512 bytes accumulate in this memory span whenever they escape to the heap, and a new memory span is allocated when this span becomes full.
When GC occurs, GreenTea GC queues these memory spans and inspects them sequentially. In this process, GreenTea GC uses scheduling almost identical to the existing GMP model. Work stealing is also implemented. In any case, a worker that retrieves a memory span from the queue inspects the internal objects of the memory span allocated to it. The Tri-Color Mark and Sweep is used identically in this process.
What are the Advantages?
The primary difference from the existing GC in this process is largely singular: the unit of GC execution has shifted from objects to memory spans. This provides the GC with the following advantages:
- Memory Locality: Since objects are grouped within memory spans, memory jumps are minimized when traversing objects. This maximizes CPU cache utilization.
- GC Performance Improvement: By performing GC at the memory span level, the GC operates more efficiently in multi-core environments.
- Memory Allocation Optimization: Memory spans are allocated with a fixed size, which reduces the overhead of memory allocation for small objects and decreases the likelihood of fragmentation. This enhances the efficiency of memory deallocation.
In short, we can now allocate objects smaller than 512 bytes more frequently and with less concern.
However, there are also certain advantageous scenarios:
- Tree/Graph Structures: When fan-out is high and changes are infrequent.
- B+ trees, Tries, AST (Abstract Syntax Tree)
- Cache-friendly data structures
- Batch Processing: Workflows involving the bulk allocation of small data.
- Numerous small objects created by JSON parsing.
- DAO objects from database result sets.
In these cases, GreenTea GC can achieve better performance than the existing GC. However, it is not applicable in all scenarios, and particularly when objects are dispersed across different memory regions, a dramatic performance improvement is still not to be expected.
Conclusion
The Golang team appears to have a long-term intention to continue improving GC. This GreenTea GC can be considered a minor change in a small area and does not replace the existing GC. However, GreenTea GC seems to be an interesting case that offers insight into the problems the Golang team faces or anticipates. The attempt to solve a surprisingly complex problem with the addition of a relatively simple concept was also impressive. Personally, I find it an interesting case.
GreenTea GC is an experimental feature introduced with Go 1.25. It can be activated and used by setting the GOEXPERIMENT=greenteagc
environment variable. Since this feature is still experimental, sufficient testing is required before using it in a production environment.