Posts

Showing posts with the label performance

On "Quantifying the Performance of Garbage Collection vs. Explicit Memory Management"

The computer science research paper " Quantifying the Performance of Garbage Collection vs. Explicit Memory Management " by Emery Berger and Matthew Hertz contains an interesting study about memory management. However, the conclusions given in the paper were badly worded and are now being used to justify an anti-GC ideology. Introduction That paper describes an experiment that analyzed the performance of a benchmark suite using: Tracing garbage collection. Oracular memory management (precomputing the earliest point free could have been inserted). The experiment was performed: On one VM, the Jikes Research Virtual Machine (RVM). Using one programming language, Java, and consequently one programming paradigm, object oriented programming. Using each of the five different garbage collection algorithms provided by that VM. The five GC algorithms are: GenMS - Appel-style generational collector  (1988) GenCopy -  two generations with copying mature space CopyMS -  nursery...

Background reading on the reference counting vs tracing garbage collection debate

Image
Eight years ago I answered a question on Stack Overflow about the suitability of OCaml and Haskell for soft real-time work like visualization: " for real-time applications you will also want low pause times from the garbage collector. OCaml has a nice incremental collector that results in few pauses above 30ms but, IIRC, GHC has a stop-the-world collector that incurs arbitrarily-long pauses " My personal experience has always been that RAII in C++ incurs long pauses when using non-trivial data (i.e. nested, structured, collections of collections of collections, trees, graphs and so on), non-deferred reference counting has the same problem for the same reason, tracing garbage collectors like OCaml work beautifully but there are many notoriously bad tools like Java that have given tracing garbage collection a bad name. Now that I am revisiting this issue I am surprised to find many individuals and organisations repeating exactly the same experimental tests that I did and coming...

Does reference counting really use less memory than tracing garbage collection? Mathematica vs Swift vs OCaml vs F# on .NET and Mono

Image
Our previous post caused some controversy by questioning the validity of some commonly-held beliefs. Specifically, the beliefs that reference counting (RC) always requires less memory than tracing garbage collection (GC) and that tracing GCs require 3-4x more memory in order to perform adequately. We presented a benchmark written in Swift and OCaml and noted that the RC'd Swift implementation ran over 5x slower and required over 3x more memory than the tracing GC'd OCaml implementation. This observation disproves these beliefs in their strong form and even brings into question whether there is even any validity in the weak forms of those beliefs. After all, we have never seen any empirical evidence to support these beliefs in any form. We received a lot of criticism for that post. The vast majority of the criticism was not constructive but two valid points did arise. Firstly, although our result was anomalous it would be more compelling to see the benchmark repeated across a w...

Does reference counting really use less memory than tracing garbage collection? Swift vs OCaml

Image
The way software developers cling to folklore and avoid proper experimental testing of hypotheses really bothers me. I think perhaps the worst branch of computer science still riddled with myths and legends is memory management. Despite over half a century of research on garbage collection showing that reference counting is inferior to tracing garbage collections algorithms (even when almost all GC research restricts consideration to Java when much better algorithms have been known for years) there are still many people who claim otherwise. C++ developers obviously still believe that reference counted smart pointers are superior to tracing garbage collection but now other people are doing it too. People who should know better. This situation recently reared its ugly head again when Apple released a promising new programming language called Swift that shuns tracing garbage collection in favor of reference counting. Now, there are logical reasons for Apple to have chosen reference count...

How do reference counting and tracing garbage collection compare?

Reference counting works by counting the number of references to each object. When the reference count drops to zero the object is definitely unreachable and can be recycled. The first advantage is simplicity. The second advantage is that decrements can occur as locals fall out of scope so collection can be deterministic and predictable. The third advantage is that the asymptotic complexity of reference counting does not depend upon the total size of the heap. The first problem is that cycles keep reference counts above zero even when they are unreachable, so reference counting alone cannot handle the general case. The second problem with reference counting is that incrementing and decrementing a counter in an object every time a reference to it is copied or dropped is very computationally expensive because it incurs a cache miss even if nothing else in the object is read or written. The third problem is that multithreaded reference counting is non-deterministic because increments and ...

A look at the IronJS source code

The author of IronJS has made some claims (e.g. F# can feel “like a slightly crappier version of C#”) that I find surprising so I'm reading his code. I find it to be quite odd for a variety of reasons: Surprisingly OOP (dozens of interfaces and classes and hundreds of augmentations, static members and overloads) given how well suited ML is to writing compilers. Lots of low-level bit-twiddling. Uses  List.empty  instead of  [] . Uses large tuples. Uses arrays of closures, e.g.  Stmt : (Lexer.Token -> State -> Ast.Tree) array . Doesn't use a symbol table. Choice of data structures appears to incur lots of unnecessary boxing. Odd choice of data structures for memory representation, e.g.  null ,  left  and  stmt  fields for a token are scattered. Doesn't seem to use concurrency or parallelism. Parser contains only 42  match  expressions in 1,400 lines of code. Massive amounts of code duplication. Hand-rolled parser is unusual. Ha...

Are functional languages inherently slow?

Functional languages require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope. Beware of self selection. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark. The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are triv...

Why do garbage collectors pause?

GCs work by tracing reachable heap blocks starting from a set of global roots (global variables, thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly. Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs incrementally update their interpretation of the heap as the mutators run. Wholly snapshot GCs attain high throughput because the collector runs almost entirely independently of the mutators but have high latency because taking a snapshot incurs a pause. Wholly on-the-fly GCs attain low latency because everything is done incrementally but low throughput because of the fine grained communication between the mutators and GC. In practice, all GCs lie somewhere between these two  extremes.  VCGC  is primarily a snapshot GC but it uses a write barrier to keep the collector apprised of changes to the heap topology.  Staccato  was the world's first parallel and concurrent and real-time GC but it sti...

The "C++ renaissance"

Image
According to Herb Sutter, C++ (and C and Fortran) are unmatched for performance per dollar and, therefore, he believes C++ will once again top the programming language charts as mobile devices make power consumption (which he equates with performance) a high priority again: Herb said this in his recent lecture  Why C++?  and other Microsoft employees from the Visual C++ team have called this prophecy the  "C++ Renaissance" . Similar statements about the superior performance of C++ were commonplace over 15 years ago when Java was new and unproven but we now know they were almost entirely wrong. The world migrated from native to managed languages over 10 years ago and never looked back because their performance is more than adequate. Furthermore, the implementations of managed languages have improved substantially since then and even toy benchmarks now show them competing with or even beating native code. Furthermore, the difficulty of optimizing large code bases means that...

The LMAX disruptor and Baker's Treadmill

Image
A new retail financial trading platform called LMAX recently published their work on new software technology for low-latency servers that they call the "disruptor": We have noticed a striking similiarity between their disruptor and an old garbage collection algorithm from 1992 called Baker's Treadmill : The LMAX disruptor and the story behind it are very interesting in their own right and well worth reading up on. In their system, binary messages come from the "receiver" and are distributed to the "journaller", "replicator" and "unmarshaller" in parallel and the "unmarshaller" then passes on the deserialized messages to the "consumer". The core of their idea is to accomplish all of this message passing using a single shared data structure, the disruptor, rather than using several separate concurrent queues. Baker's Treadmill is a low-latency garbage collection algorithm. Allocated blocks in the heap are li...

IO throughput: Haskell vs F#

Image
Many applications require the ability to churn through gigabytes of data. In such cases, high IO throughput is valuable. This article examines high-throughput IO in the Haskell and F# programming languages. The Haskell solutions use the ByteString library and the F# solutions use ordinary .NET 4 IO. An eager Haskell solution that loads the entire file into memory may be written as follows: import System import Bits import Data.List import Data.ByteString as B main = getArgs ...

Haskell's hash tables revisited: part 2

Our previous blog post contained several benchmark results comparing the new GHC 6.12.3 with F#. We have since discovered some other points of interest regarding this benchmark. Firstly, the results for Haskell rely on the use of a garbage collector that prevents parallelism. If the more modern multicore-friendly GC is used (by compiling with -threaded and running with +RTS -N8 ) then the time taken increases from 4.5s to 10.6s. This is over 2× slower than before and now over 13× slower than F#. Naturally, the F# was already using the multicore-capable .NET garbage collector so this was an unfair bias in favor of Haskell. Secondly, the Haskell code exploits an algorithmic optimization on the assumption that the keys are unique. This is often not the case in practice and, again, the F# code did not exploit such an assumption so this was another unfair bias in favor of Haskell. A fairer comparison may be obtained by changing from the insert function to the update function in the Hask...

Haskell's hash tables revisited

Update: We have since discovered that these results were biased towards Haskell. Mikhail Glushenkov recently announced the Haskell Platform 2010.2 RC for Windows. In particular, this is the first release to include a version of the Glasgow Haskell Compiler (6.12.3) that has the new garbage collector fix to address the performance problems Haskell programmers have been experiencing with mutable arrays of boxed values over the past 5 years, such as the spines of hash tables. Naturally, we couldn't resist benchmarking the new release to see if it lives up to the promise of decent hash table performance. Even though this installer for the Haskell Platform is just a release candidate, we found that it installed smoothly and ran correctly first time. First up, a repeat of our previous benchmark which inserted 10,000,000 bindings with int keys mapping to int values into an initially-empty hash table: GHC 6.12.1: 19.2s GHC 6.12.3: 4.48s F# .NET 4: 0.8s The new version of GHC is clear...

Naive parallelism with HLVM

Image
The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results: These results exhibit several interesting characteristics: Even the naive parallelization in C++ is significantly faster than HLVM and Haskell. C++ and HLVM scale well, with performance improving as more cores are used. Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores. The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:...

Naïve Parallelism: a rebuttal

Image
Several people including Simon Marlow of Microsoft have objected to our rejection of Saynte's new Haskell code, claiming that the alterations were minor and that they optimize serial performance. This is simply not true. Firstly, over half of the lines of code in the entire program have been altered. Secondly, the new version is slower than Lennart's original version 5 on 1 core. So there is no logical reason to choose the revised Haskell for the basis of a naive parallelization unless you want to cherry pick results by leveraging knowledge of how they will scale after parallelization. Suffice to say, doing so would be bad science. This is illustrated in the following graph of performance results for Lennart's original version 5 vs Saynte's new revised Haskell with 11 levels of spheres at 1,024×1,024: The original code is 5% faster on 1 core but scales poorly and is 2.7× slower on 7 cores. The substantially-revised "serial" Haskell code was obviously specif...

Naïve Parallelization: C++ vs Haskell

Image
A member of the Haskell community recently published a blog article revisiting our ray tracer language comparison, claiming to address the question of how naïve parallelizations in these two languages compare. The objective was to make only minimal changes to the programs in order to parallelize them and then compare performance. Our attempts to verify those results turned up a lot of interesting information. Firstly, the Haskell program that was supposedly naïvely parallelized was not the original but, in fact, a complete rewrite. This raises the question of whether or not the rewrite was specifically designed to be amenable to parallelization and, therefore, is not representative of naïve parallelization at all. The C++ used was the original with minimal changes to parallelize a single loop. Secondly, although the serial benchmark results covered a spectrum of inputs, the parallel results covered only a single case and retrospectively identified the optimal results without alluding ...

HLVM on the ray tracer language comparison

Image
We recently completed a translation of our famous ray tracer language comparison to HLVM . The translation is equivalent to the most highly optimized implementations written in other languages and this allows us to compare HLVM with a variety of competing languages for the first time. The results are astonishing. Running the benchmark with the default settings (level=9, n=5 to render 87,381 spheres at 512×512) on 32-bit x86 gives the following times for different languages: These results show that HLVM already provides competitive performance for a non-trivial benchmark. HLVM took 6.7s whereas C++ (compiled with g++ 4.3.3) took only 4.3s and Haskell (compiled with GHC 6.12) took 13.9s. However, cranking up the level parameter to 12 in order to increase the complexity of the scene, rendering a whopping 5,592,405 spheres, we find that HLVM blows away the other garbage collected languages and is even able to keep up with C++: This remarkable result is a consequence of HLVM's space-ef...