Posts

Showing posts with the label garbage collection

On "Quantifying the Performance of Garbage Collection vs. Explicit Memory Management"

The computer science research paper " Quantifying the Performance of Garbage Collection vs. Explicit Memory Management " by Emery Berger and Matthew Hertz contains an interesting study about memory management. However, the conclusions given in the paper were badly worded and are now being used to justify an anti-GC ideology. Introduction That paper describes an experiment that analyzed the performance of a benchmark suite using: Tracing garbage collection. Oracular memory management (precomputing the earliest point free could have been inserted). The experiment was performed: On one VM, the Jikes Research Virtual Machine (RVM). Using one programming language, Java, and consequently one programming paradigm, object oriented programming. Using each of the five different garbage collection algorithms provided by that VM. The five GC algorithms are: GenMS - Appel-style generational collector  (1988) GenCopy -  two generations with copying mature space CopyMS -  nursery...

Background reading on the reference counting vs tracing garbage collection debate

Image
Eight years ago I answered a question on Stack Overflow about the suitability of OCaml and Haskell for soft real-time work like visualization: " for real-time applications you will also want low pause times from the garbage collector. OCaml has a nice incremental collector that results in few pauses above 30ms but, IIRC, GHC has a stop-the-world collector that incurs arbitrarily-long pauses " My personal experience has always been that RAII in C++ incurs long pauses when using non-trivial data (i.e. nested, structured, collections of collections of collections, trees, graphs and so on), non-deferred reference counting has the same problem for the same reason, tracing garbage collectors like OCaml work beautifully but there are many notoriously bad tools like Java that have given tracing garbage collection a bad name. Now that I am revisiting this issue I am surprised to find many individuals and organisations repeating exactly the same experimental tests that I did and coming...

Does reference counting really use less memory than tracing garbage collection? Mathematica vs Swift vs OCaml vs F# on .NET and Mono

Image
Our previous post caused some controversy by questioning the validity of some commonly-held beliefs. Specifically, the beliefs that reference counting (RC) always requires less memory than tracing garbage collection (GC) and that tracing GCs require 3-4x more memory in order to perform adequately. We presented a benchmark written in Swift and OCaml and noted that the RC'd Swift implementation ran over 5x slower and required over 3x more memory than the tracing GC'd OCaml implementation. This observation disproves these beliefs in their strong form and even brings into question whether there is even any validity in the weak forms of those beliefs. After all, we have never seen any empirical evidence to support these beliefs in any form. We received a lot of criticism for that post. The vast majority of the criticism was not constructive but two valid points did arise. Firstly, although our result was anomalous it would be more compelling to see the benchmark repeated across a w...
Another deleted answer of mine from Stack Overflow: As I can see, smart pointers are used extensively in many real-world C++ projects. True but, objectively, the vast majority of code is now written in modern languages with tracing garbage collectors. Though some kind of smart pointers are obviously beneficial to support RAII and ownership transfers, there is also a trend of using shared pointers by default, as a way of "garbage collection", so that the programmer do not have to think about allocation that much. That's a bad idea because you still need to worry about cycles. Why are shared pointers more popular than integrating a proper garbage collector like Boehm GC? (Or do you agree at all, that they are more popular than actual GCs?) Oh wow, there are so many things wrong with your line of thinking: Boehm's GC is not a "proper" GC in any sense of the word. It is truly awful. It is conservative so it leaks and is inefficient by design. See: http://flying...

Memory management myths in Apple circles

Image
Nothing gets my hackles up more than people perpetuating memory management myths. Apparently there is a new trend in town thanks to Apple, who are deprecating their garbage collector on OS X in favor of Automatic Reference Counting (ARC). News website Cult of Mac say: “ iOS is twice as memory-efficient as Android. Here’s why. .. According to Glyn Williams over on Quora,  iOS devices run better than Android devices with twice the RAM  because Android apps use Java, and need all the extra RAM to do something called garbage collection.” Another news website, Redmond Pie, say: “That was basically the same question put to Quora, the social website that gives people a way to ask questions and then have them answered by people who are experts in their respective field. The upvoting system adds a spot of authority tracking to the answers that are provided, and we have a clear winner as far as the question around why Android phones have so much more memory than iPhones. Enter Glyn...

Memory management myths: promptness

People often assert that scope-based reference counting such as shared_ptr in C++ collects garbage “promptly” and some people define this as “collects at the earliest possible point”. For example, at the time of writing the Wikipedia page about garbage collection says: “ Compared to tracing garbage collection, reference counting guarantees that objects are destroyed as soon as they become unreachable … ”  – Wikipedia Similar claims can even be seen in published research such as the paper “ Down for the Count? Getting Reference Counting Back in the Ring ”: “Of the two fundamental algorithms on which the garbage collection literature is built, reference counting has lived in the shadow of tracing. It has a niche among language developers for whom either performance or completeness is not essential, and is unused by mature high performance systems, despite a number of intrinsic advantages such as promptness of recovery and dependence on local rather than global state.” – Blackburn...

Why is garbage collection an important feature?

Garbage collection eliminates a class of bugs caused by erroneous memory management (forget to free, free too soon, free more than once). Garbage collection removes the need for APIs to describe contracts about memory management. Garbage collection facilitates programming styles such as first-class lexical closures in functional programming (see the  Funarg problem ).

Are functional languages inherently slow?

Functional languages require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope. Beware of self selection. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark. The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are triv...

Why do garbage collectors pause?

GCs work by tracing reachable heap blocks starting from a set of global roots (global variables, thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly. Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs incrementally update their interpretation of the heap as the mutators run. Wholly snapshot GCs attain high throughput because the collector runs almost entirely independently of the mutators but have high latency because taking a snapshot incurs a pause. Wholly on-the-fly GCs attain low latency because everything is done incrementally but low throughput because of the fine grained communication between the mutators and GC. In practice, all GCs lie somewhere between these two  extremes.  VCGC  is primarily a snapshot GC but it uses a write barrier to keep the collector apprised of changes to the heap topology.  Staccato  was the world's first parallel and concurrent and real-time GC but it sti...

The LMAX disruptor and Baker's Treadmill

Image
A new retail financial trading platform called LMAX recently published their work on new software technology for low-latency servers that they call the "disruptor": We have noticed a striking similiarity between their disruptor and an old garbage collection algorithm from 1992 called Baker's Treadmill : The LMAX disruptor and the story behind it are very interesting in their own right and well worth reading up on. In their system, binary messages come from the "receiver" and are distributed to the "journaller", "replicator" and "unmarshaller" in parallel and the "unmarshaller" then passes on the deserialized messages to the "consumer". The core of their idea is to accomplish all of this message passing using a single shared data structure, the disruptor, rather than using several separate concurrent queues. Baker's Treadmill is a low-latency garbage collection algorithm. Allocated blocks in the heap are li...

Real garbage collector characteristics

Image
The trade-offs between tracing and reference counting garbage collectors are nicely demonstrated by systems like F# and Mathematica. F# inherits garbage collection from .NET which uses a conventional generational tracing garbage collector with three generations and a Large Object Heap (LOH). Mathematica uses reference counting with language semantics that make it impossible to create cycles in the heap. The following Mathematica program creates a balanced binary tree 25 levels deep that contains 2 25 branches stores it in a variable and then mutates the variable back to the empty list: Because Mathematica uses reference counting, the act of resetting the variable back to the empty list decrements the reference count for the root of our tree back to zero, causing an avalanche of reference counts in the branches down to the leaves also being decremented back to zero, reclaiming all of the space that was consumed by the tree that is now unreachable. Moreover, Mathematica is serial so all...

Classifying garbage collection algorithms

Richard Jones' excellent new book about garbage collection has rekindled my fascination with this important subject. The Wikipedia page about GC is disappointingly poor quality so it is productive to review the main points here. GC algorithms are categorized into four main families with production garbage collectors often combining algorithms from different families. The families are copying, mark-sweep, mark-compact and reference counting. The first three are all tracing collectors that work by tracing all reachable values by starting from the global roots (global variables and locals held on the stack). Copying collectors allocate into one space and then "evacuate" the reachable heap blocks (called "survivors") into another space before clearing the space they came from. The Cheney semi-space algorithm is the simplest copying collector: it uses two spaces and copies from one to the other and back again. The advantage of copying collection is that many heap-...

New book on garbage collection

After 15 years, Richard Jones has written a new monograph about garbage collection called The Garbage Collection Handbook . This supercedes his previous seminal book on the subject Garbage Collection: algorithms for automatic dynamic memory management . In particular, this new book covers state-of-the-art techniques including parallel, incremental, concurrent and real-time garbage collection algorithms. I'll post a review as soon as I've read my copy!

Sweeping 700× faster

Image
Our initial experiments using the new garbage collector design indicate that it dramatically improves the performance of the slowest GC phase (sweeping) as expected, to the extent that it brings the overall performance of our C++ prototype within 10% of OCaml without introducing any of the overheads of generational garbage collection: Moreover the algorithm is very simple and, in particular, parallel and concurrent variants should be much easier to design with this style of collector than with generational collectors because regions are a suitable granularity. The sweep phase of the GC accounted for around a third of the total running time of the entire program using the traditional algorithm with allocated and free lists, with each sweep taking around 70µs. Using the new bitwise algorithm, the sweep phase takes just 100ns and accounts for just 0.7% of the total running time of the program. Our prototype mark region collector using a fake marking phase is now between 2 and 10% slower ...

The importance of locality and sparsity in memory management

Image
Our previous articles describing the disadvantages of generational garbage collection and our prototype mark-region memory management system designed for HLVM originally showed that region-based allocation and deallocation has the potential to be only 4-20% slower than OCaml's generational collector. However, our more recent work that was designed to be more realistic by deferring deallocations to bulk GC cycles was significantly slower, around twice as slow as OCaml. There are several differences between the stack-based deallocation scheme used in the first benchmark and the GC-like deallocation scheme used in the second benchmark that have the potential to account for this performance degradation: The mock GC introduced mark bytes into the allocated values and marked them as unreachable when they fell out of scope in the mutator. The mock GC allocated by popping a reference of the top of a region's free list. The mock GC deallocated by pushing a reference onto the free lis...

Boost's shared_ptr up to 10× slower than OCaml's garbage collection

Image
Our recent post about the merits of accurate garbage collection over reference counting prompted Tezka to ask for measurements demonstrating the performance differences between an accurate GC and the reference counted shared_ptr smart pointers provided by the Boost C++ library. The benchmarks we have been using to prototype allocators for our HLVM project provide the perfect setting for such an experiment. We already have C++ code that solves the n -queens problem using logic programming with a variety of different allocation strategies. Adding another version of our benchmark using Boost's shared_ptr gave the following remarkable results: These new results highlight just how slow reference counting can be. The C++ code using Boost's reference counted smart pointers is running up to 10× slower than OCaml! Note also the new "stack" section that is the first C++ to beat OCaml, albeit cheating by exploiting the fact that this implementation of this benchmark always hap...

Why GC when you have reference counted smart pointers?

Reference counted smart pointers are a simple form of garbage collection usable from the C++ programming language. A recent question on Stack Exchange asks why anyone would want anything more when reference counted smart pointers are already available. Other forms of garbage collection (most notably tracing GCs) have several advantages over reference counting: Accuracy: Reference counting alone leaks cycles so reference counted smart pointers will leak memory in general unless other techniques are added to catch cycles. Once those techniques are added, reference counting's benefit of simplicity has vanished. Throughput: Smart pointers are one of the least efficient forms of garbage collection, particularly in the context of multi-threaded applications when reference counts are bumped atomically. There are advanced reference counting techniques designed to alleviate this but tracing GCs are still the algorithm of choice in production environments. Latency: Typical smart pointer ...

Towards a mark-region GC for HLVM

Image
Our previous article highlighted the advantages of the recent mark-region GC design and hinted at HLVM adopting this design. We just completed some preliminary tests using a prototype written in C++ to measure the performance of different allocation strategies. Our results are as follows with times normalized by the time an equivalent OCaml program takes (so 1.0 means as fast as OCaml): The four columns in each section give the times relative to OCaml for solving the 8-, 9-, 10- and 11-queens problems. The "Boehm" section refers to the conservative Boehm GC which is 40-70% slower than OCaml on this benchmark. The "malloc" section refers to allocating using the malloc function from glibc without ever freeing and is 2.2-3.1× slower than OCaml. The "free" section refers to allocating with malloc and freeing (manually) and is 1.9-2.3× slower than OCaml. The "bump" section refers to a naive bump allocator that never recycles memory and is 1.4-1.7× sl...