Posts

Showing posts with the label parallel

What's new in garbage collection?

Since the 1950s there have been three main families of collectors: semi-space, mark-sweep and mark-compact. Almost all production GCs have been generational mark-sweep even though they exhibit pathological performance when the nursery is full of survivors because they are marked, (physically) copied and all references to them updated which is ~3x slower than necessary and is practically useful when filling a hash table with heap-allocated keys and/or values. The Beltway (2002) and Immix (2008) garbage collectors introduced the new family called the mark-region GCs. With a mark-region GC the entire heap is a collection of regions and so is the nursery generation so it can be logically aged by replacing it with another region when it is full of survivors. Sun's Hotspot JVM introduced the first mainstream mark-region GC with its G1 collector. The advent of multicore in 2005 has meant more emphasis on parallel and concurrent garbage collectors. The Staccato (2008) garbage collector ...

Efficient Parallel Stencil Convolution in Haskell

Until recently, all research on parallel Haskell programs neglected CPU caches and existing high-performance software solutions like BLAS , LAPACK , FFTW and OpenCV . We criticized the paper Regular, shape-polymorphic, parallel arrays in Haskell for comparing parallel Haskell programs only with poorly-written C code, particularly because this buries the real performance differences in unnecessary cache misses. In 2011, Ben Lippmeier, Gabriele Keller and Simon Peyton-Jones published a paper entitled Efficient Parallel Stencil Convolution in Haskell that is the first to address caches and locality in the context of parallel Haskell programs and to compare parallel Haskell programs to existing high-performance solutions. The paper discusses an algorithm for edge detection in computer vision that was 10x slower using the old REPA approach and walks through the design and implementation of a more advanced solution using new approaches that take CPU caches into account and, ultimately, a...

Texas Multicore Technologies on Haskell

Image
US-based startup Texas Multicore Technologies have published some of the results they obtained using Haskell for parallel programming. Their results mirror our own : Thanks to Manuel Chakravarty of the University of New South Wales for drawing this to our attention .

What is the difference between parallel and concurrent programming?

Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang is a language designed specifically for ...

Don Syme on "Functional approaches to parallelism and concurrency"

Don Syme, creator of the F# programming language, recently gave a superb lecture on parallel and concurrent programming using F# at QCon 2010. Video and slides hosted by InfoQ here . Parallel programming continues to be a hot topic in the face of multicore computing but, as Don points out, the world is also moving steadily towards more concurrent programming. Work continues on our forthcoming "Multicore .NET" book that studies parallel programming using C# and F# in detail...

Parallel generic quicksort in Haskell

Haskell has a history of making easy problems difficult. Perhaps the most infamous example was the Sieve of Eratosthenes, which is easily implemented in any imperative language but was so difficult to write in Haskell that almost all of the solutions that had been taught in universities and used in research for the preceding 18 years had been wrong until Melissa O'Neill published a seminal paper The Genuine Sieve of Eratosthenes that gave a beautiful description of what they had been doing wrong and how it should be corrected. Melissa's solution was to use a priority queue to implement a rolling wheel of numbers. The correct solution turned out to be 10× longer than a much simpler F# solution and a whopping 100× longer than the original bastardized algorithm in Haskell. Today, quicksort is the new Sieve of Eratosthenes. Again, the academics have addressed Haskell's failure by bastardizing the algorithm , trading orders of magnitude in performance for something that Haskel...

Naive parallelism with HLVM

Image
The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results: These results exhibit several interesting characteristics: Even the naive parallelization in C++ is significantly faster than HLVM and Haskell. C++ and HLVM scale well, with performance improving as more cores are used. Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores. The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:...

Naïve Parallelism: a rebuttal

Image
Several people including Simon Marlow of Microsoft have objected to our rejection of Saynte's new Haskell code, claiming that the alterations were minor and that they optimize serial performance. This is simply not true. Firstly, over half of the lines of code in the entire program have been altered. Secondly, the new version is slower than Lennart's original version 5 on 1 core. So there is no logical reason to choose the revised Haskell for the basis of a naive parallelization unless you want to cherry pick results by leveraging knowledge of how they will scale after parallelization. Suffice to say, doing so would be bad science. This is illustrated in the following graph of performance results for Lennart's original version 5 vs Saynte's new revised Haskell with 11 levels of spheres at 1,024×1,024: The original code is 5% faster on 1 core but scales poorly and is 2.7× slower on 7 cores. The substantially-revised "serial" Haskell code was obviously specif...

Naïve Parallelization: C++ vs Haskell

Image
A member of the Haskell community recently published a blog article revisiting our ray tracer language comparison, claiming to address the question of how naïve parallelizations in these two languages compare. The objective was to make only minimal changes to the programs in order to parallelize them and then compare performance. Our attempts to verify those results turned up a lot of interesting information. Firstly, the Haskell program that was supposedly naïvely parallelized was not the original but, in fact, a complete rewrite. This raises the question of whether or not the rewrite was specifically designed to be amenable to parallelization and, therefore, is not representative of naïve parallelization at all. The C++ used was the original with minimal changes to parallelize a single loop. Secondly, although the serial benchmark results covered a spectrum of inputs, the parallel results covered only a single case and retrospectively identified the optimal results without alluding ...

High-performance parallelism with HLVM

Image
Our open source HLVM project recently reached a major milestone with new support for high-performance shared-memory parallel programming. This is a major advance because it places HLVM among the world's fastest high-level language implementations. The previous HLVM implementation had demonstrated the proof of concept using a stop-the-world mark-sweep garbage collector. This new release optimizes that robust foundation by carrying thread-local data in registers in order to provide excellent performance. The following benchmark results show HLVM beating OCaml on many serial benchmarks on x86 despite the overheads of HLVM's new multicore-capable garbage collector: HLVM is over 2× faster than OCaml on average over these results and 4× faster on the floating-point Fibonacci function.