Posts

Showing posts with the label hlvm

Lessons learned from HLVM

Although our HLVM project is intended to bring modern VM techniques to modern programming languages without requiring any new research, we have ended up producing several enlightening new results. This article takes a look at some of the weird and wonderful techniques we used in HLVM . GC as a metaprogram (good!) Whereas most VMs use a garbage collector written in C or C++, the GC in HLVM is written in its own intermediate representation. This unusual design has a variety of benefits: The GC acts as a test for the compiler. The GC benefits from improvements we make to the compiler. The GC is simplified by being able to use features like tail call elimination and higher-order functions. Overall, implementing the GC in HLVM's own IR proved to be hugely beneficial: the GC was very easy to write and is easy to maintain. Fat references (bad!) Most VMs store header data in each heap-allocated block that describes the size and type of a value in order to allow the garbage collector to tr...

Variant types and pattern matching in HLVM

The compiler development series of articles in The OCaml Journal have culminated in an example front-end compiler targetting HLVM that now supports variants and pattern matching in addition to previous features. These new language features have been implemented using only a few lines of code, bringing the front-end up to only 253 lines of OCaml code! This new functionality allows substantially more complicated ML-style programs to be written, including the following symbolic simplifier: # type expr = Int of int | Var of int | Add of expr * expr | M...

The HLVM Reddit

The Reddit user ethicszen has kindly created a subreddit for HLVM-related news and discussion . Please post links to interesting HLVM articles there. Also, don't forget to join our mailing list if you want to hear what is going on or discuss HLVM in any way.

Naive parallelism with HLVM

Image
The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results: These results exhibit several interesting characteristics: Even the naive parallelization in C++ is significantly faster than HLVM and Haskell. C++ and HLVM scale well, with performance improving as more cores are used. Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores. The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:...

HLVM on the ray tracer language comparison

Image
We recently completed a translation of our famous ray tracer language comparison to HLVM . The translation is equivalent to the most highly optimized implementations written in other languages and this allows us to compare HLVM with a variety of competing languages for the first time. The results are astonishing. Running the benchmark with the default settings (level=9, n=5 to render 87,381 spheres at 512×512) on 32-bit x86 gives the following times for different languages: These results show that HLVM already provides competitive performance for a non-trivial benchmark. HLVM took 6.7s whereas C++ (compiled with g++ 4.3.3) took only 4.3s and Haskell (compiled with GHC 6.12) took 13.9s. However, cranking up the level parameter to 12 in order to increase the complexity of the scene, rendering a whopping 5,592,405 spheres, we find that HLVM blows away the other garbage collected languages and is even able to keep up with C++: This remarkable result is a consequence of HLVM's space-ef...

High-performance parallelism with HLVM

Image
Our open source HLVM project recently reached a major milestone with new support for high-performance shared-memory parallel programming. This is a major advance because it places HLVM among the world's fastest high-level language implementations. The previous HLVM implementation had demonstrated the proof of concept using a stop-the-world mark-sweep garbage collector. This new release optimizes that robust foundation by carrying thread-local data in registers in order to provide excellent performance. The following benchmark results show HLVM beating OCaml on many serial benchmarks on x86 despite the overheads of HLVM's new multicore-capable garbage collector: HLVM is over 2× faster than OCaml on average over these results and 4× faster on the floating-point Fibonacci function.

How does HLVM's GC work?

HLVM injects code into managed user-code functions that keep two data structures up to date: The shadow stack lists all references held on the stack or in registers and acts as the set of global roots into the heap. If a function accepts an argument of a reference type or allocates then the value is pushed onto the shadow stack. On function exit, the shadow stack is reset back to the way it was when the function was entered. The allocated list gives all of the references that have been allocated by the program. When a function allocates, the value is appended onto the allocated list. When the allocation quota is exceeeded a garbage collection occurs. This is composed of two phases: Mark. Starting from the global roots on the shadow stack, all reachable data in the heap are marked as reachable (and the rest left marked as unreachable). Sweep. The allocated list is traversed, unreachable values are freed and all values have their mark state set back to unreachable. This functionality ...

Current shadow stack overheads in HLVM

Image
Many people, including the developers of Haskell (see here ) and OCaml (see here ) compilers, have been quick to dismiss the shadow stack algorithm for performance reasons. Despite these concerns, we chose to use a shadow stack in HLVM for several reasons: GC performance is low on the list of priorities for HLVM. Shadow stacks are easy to implement entirely from within HLVM whereas conventional strategies entail injecting stack maps into stack frames on the normal stack and that requires low-level hacking with LLVM's experimental GC API in C++. Shadow stacks are easy to debug because they are just another data structure in the heap. HLVM provides typed nulls and the ability to read array lengths or type constructor tags without an indirection. This requires the use of an unconventional struct-based reference type that is incompatible with LLVM's current GC API that can only handle individual pointers. Consequently, it is interesting to use HLVM to measure just how expensive it...

HLVM's first front-end

Alp Mestan has begun work on the first front-end for HLVM. This project has many important goals: Provide a high-level language that HLVM developers can write test code, benchmarks and garbage collectors in. Serve as a tutorial for the people porting existing compilers such as Moscow ML, NekoML and PolyML to HLVM. Track new features as they are added to HLVM, such as closures, parametric polymorphism, parallelism and so on. Alp intends to apply to Jane St. Capital to fund a summer project that will port the entire OCaml compiler to HLVM.

Performance: OCaml vs HLVM beta 0.4

Image
A quick update due to a surprise performance result for HLVM ! We knew that manipulating the shadow stack was a major slowdown in HLVM from the change in our benchmark results when the GC was first introduced a few weeks ago but we did not expect that a simple local optimization, unrolling, could go so far towards recovering performance. Moreover, this optimization was implemented in only a few lines of OCaml code. The following results prove that HLVM now produces substantially faster programs than ocamlopt for numerical tasks on x86: Interestingly, now that HLVM supports standalone compilation using LLVM's IR optimization passes as well as unoptimized JIT compilation, we see that LLVM's optimization passes only give small performance improvements in many cases and even substantially degrade performance on our 10-queens benchmark. This is a direct result of HLVM producing near optimal code directly (which greatly reduces compile times) and has eliminated our desire to add sup...

Memory footprints: OCaml vs HLVM

Someone recently published a blog article about the sizes of values in the Ruby and OCaml programming languages. This is also interesting in the context of HLVM because our high performance goals also require an efficient memory representation but our solution is quite different to OCaml's because HLVM is designed for numerical performance whereas OCaml was designed for symbolic performance. The following table lists the sizes of values of several different types in OCaml and HLVM on 32-bit architectures: Type OCaml HLVM unit 32 bits 32 bits bool 32 bits 8 bits int31 32 bits None int32 96 bits 32 bits int64 128 bits 64 bits float32 None 32 bits float64 128 bits 64 bits float * float 320 bits 128 bits Enumeration 32 bits 96 bits float64 array 64+64n bits 96+64n bits Note how HLVM uses the most memory efficient representations possible for ints, floating point numbers and tuples but uses less efficient representations for reference types such as arrays and variant types. A side-eff...

Performance: OCaml vs HLVM beta 0.3

Image
Our HLVM project is evolving rapidly and recently saw its first release with a working garbage collector. This has allowed us to make some performance measurements and, as we had hoped, the results are extremely compelling. The following graph illustrates the time taken to perform each of the benchmarks in the HLVM test suite on one of the 2.1GHz Opteron 2352 cores in an 8-core Dell PowerEdge: The individual benchmarks are: fib : The usual Fibonacci function. ffib : A floating-point version of the Fibonacci function. sieve : Sieve of Eratosthenes to find all of the primes under 10^8 and print the last one using a byte array. mandel : Mandelbrot rendering with float arithmetic. mandel2 : Mandelbrot rendering with abstracted complex arithmetic where complex numbers are pairs of floating point numbers. Array.fold : Initialize a 10^8-element float array and fold of a float * float pair over it. List.init/fold : Initialize a 10^6-element int list and fold over it (allocation intensive). 1...

HLVM has been released

We have made the first public release of our much anticipated High-Level Virtual Machine (HLVM) project. The complete source code is now available from the HLVM project page on the OCaml Forge via the subversion repository. HLVM makes extensive use of some features only recently added to LLVM and, consequently, requires a patched version of LLVM 2.4 or later. Even at this early stage, our performance results are extremely compelling. HLVM is already several times faster on a variety of benchmarks than some notably high-performance functional programming languages such as OCaml . HLVM is specifically designed for technical computing and we believe it will offer the best performance for numerical computing of any high-level language including easy-to-use support for parallelism. We also intend to provide extensive support for visualization using OpenGL and make HLVM a viable platform for both free and commercial software in the future. This open source project has been solely funded by ...

Building a better future: the High-Level Virtual Machine

Microsoft's Common Language Run-time (CLR) was a fantastic idea. The ability to interoperate safely and at a high-level between different languages, from managed C++ to F#, has greatly accelerated development on the Microsoft platform. The resulting libraries, like Windows Presentation Foundation, are already a generation ahead of anything available on any other platform. Linux and Mac OS X do not currently have the luxury of a solid foundation like the CLR. Consequently, they are composed entirely from uninteroperable components written in independent languages, from unmanaged custom C++ dialects to Objective C and Python. Some developers choose to restrict themselves to the lowest common denominator (e.g. writing GTK in C) which aids interoperability but only at a grave cost in productivity. Other developers gravitate to huge libraries written in custom dialects of particularly uninteroperable languages (e.g. Qt). Both approaches have a bleak future. The situation is compounded b...