Dynamic Code in D

Sun Jan 13 13:28:41 PST 2008

I don't know what the right answer is for your application, so I'll just 
try and answer where I can.

Michael Wilson wrote:
> The system carries out data mining tasks using algorithms generated by a
> genetic programming engine. The original prototype (developed before I
> joined the company) did this by construcing a Lisp-style function tree
> out of Java objects, then running that in an interpreter (i.e. it used
> ECJ). Unsurprisingly this was hideously slow. The current version that
> I designed works by compiling the function tree to Java bytecode (as
> static methods in a new class), calling an in-memory classloader on the
> output (allowing Java to link it), then allowing Hotspot to translate
> it to native machine code. However to get good performance on large
> datasets, I had to implement the terminal functions (i.e. the inner
> loops) in C, using GCC's SSE intrinsics. The dynamic java code calls
> this via the Java Native Interface (specifically, a class full of static
> native methods). However JNI has a significant call overhead, which is
> eating up a good fraction of the performance boost, as well as creating
> additional development/maintenance hassle. This system is running on a
> couple of beefy 16 core Opteron servers and it still takes hours to run
> on large datasets.

C code can be called directly from D without a JNI type interface.

> The major problem with implementing in D is the lack of runtime
> compliation and the AFAIK subpar support for dynamic linking. I'll get
> to that in a bit. The other potential problems are;
> 
> 1. Support for AMD64 on Linux is absolutely essential; these datasets
> are several gigabytes in size and need to be loaded into memory. Plus
> there is a nontrivial amount of 64-bit integer manipulation in the Java
> code. Linux obviously because what sane person would waste money on
> Windows Server licenses for compute servers? I haven't been able to
> track down the latest estimate for when DMD will support AMD64, but at
> least GDC does now. Is this stable enough for production use?

64 bit DMD is not on the horizon yet, so your only option there is GDC. 
If it isn't stable enough, I suspect that your needs will drive it to 
stability.

> 2. High-performance synchronization. As of 1.6, Java's monitor
> implementation is quite impressive; the biased locking mechanism means
> that atomic operations are only required for a small fraction of monitor
> acquisitions. I've been trying to find out if D's implementation uses
> the same mechanism. This is probably only good for a few percent of
> performance, but on 16-way NUMA locking performance can make a
> difference even when it isn't in the inner loops.

D's locking mechanism is controlled by library code, so it is adaptable 
by the user. Currently, it is rather basic.

> 3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics
> yet (i.e. "target builtins")? This post suggests that it does;
> http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=D.gnu&artnum=2622 
> 
> but I can't see anything in the documentation about them. If not I'll
> have to continue linking to the existing C code, which eliminates a
> significant benefit of going to D (simplifying the code base - but at
> least the JNI call & pinning overhead will be gone). Ideally
> std.intrinsic would have these (presumably that's how DMD is going to
> include them, if an when it does - as far as I can tell, DMD is still
> doing even basic double-precision operations with slow x87 code rather
> than SSE code, unless
> http://www.digitalmars.com/d/archives/digitalmars/D/Optimizations_of_DMD_49840.html 
> 
> is out of date).

DMD supports the SSE code via the inline assembler only. I don't know if 
GDC does. In any case, it would be simple to just link to C code that 
does it.

> Ok, on to the serious problem, dynamic code support. The GP engine is
> outputting a function tree structure. I can trivially output this as D
> or Java source. I can output it as Java bytecode with a bit more effort.
> I really don't want to have to compile it to x86 machine code (and link
> it against the terminal functions) myself, particularly given that the
> research guys keep elaborating the algorithm. Here's what I've
> considered so far;
> 
> 1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to
> build a dynamic library. Load the class in Java. This would remove most
> of the JNI overhead (because the Java/C crossover would be pushed 4 to
> 10 places up the call graph). However I'd have to keep invoking GCC as
> a seperate process, though I could cut down the overhead of this
> somewhat by putting it and the dynamic files on a ramdisk. I'm unsure
> what the performance would be like under Linux of constantly linking
> new shared libraries, and I'm not sure if the Sun JVM is capable of
> unloading native code along with unloading classes (though this memory
> leak is probably insignificant over the lifetime of the process). This
> is the solution I will be using if I can't find a good D solution.
> 
> 2. Port to D. Change the GP system to target MiniD bytecode. I haven't
> examined MiniD in detail yet, but I get the impression that this will
> be marginally harder than targetting Java bytecode. Alternatively I
> could target MiniD source, which would be simpler for me, at some
> additional compilation cost, or I could dig into the MiniD compiler
> source and see if it has an AST I could target. The later would be
> ideal if the code is stable, but it probably isn't.
> 
> MiniD is purely interpreted, so performance of the dynamic code will
> almost certainly suck badly compared to Hotspot. On the plus side,
> 'linking' cost is effectively zero. The inner loops will still have
> to be in either D (if it has decent SSE intrinsics now) or C (if it
> doesn't); at first glance MiniD native invokation overhead looks better
> than JNI (no need to pin for one thing, since the GC is non-copying)
> but still nontrivial.

If the linker overhead is too big, it's possible to do the 'linking' 
yourself by directly reading the .o file, stuffing it in memory, and 
executing it.

> 3) Port to D. Change the GP system to target D source. Batch code
> updates together. On each update, recompile the codebase using gdc
> running in a ramdisk. Start a new process using the new code. Keep
> all the persistent data in a shared memory segment. Get the new
> process to map in the shared segment, then kill the old process.
> Processing can continue from where it left off using the updated code.
> 
> AFAIK this /should/ work, but I don't know what the overheads are
> going to be like. Having to batch together the code updates is
> mildly inconvenient from the algorithm design side. I'm not sure what
> if any incremental compilation features GDC has; perhaps I could path
> it to stay in memory and stream new classes to it over a pipe.

This looks like your best shot with D. Like you say, it is dependent on 
how fast GDC can be run.

> 4) Port to D. Change the GP system to target D source. Batch code
> updated together. On each update, link the new classes in with DDL.
> 
> The major problem with this is that AFAIK DDL isn't working on Linux
> yet (at all, never mind on AMD64) and is still beta quality code
> even on Windows. In addition to that, I'm not clear if it can ever
> recover the memory for unused code, though as I mentioned for (1) this
> leak isn't likely to be a problem over the lifetime of the process.
> Plus the GDC invokation overhead as for (3).
> 
> and finally...

D DLLs work on Windows. DMD supports writing pic code on Linux, I've 
just never done the work to build a shared library using it.

> 6) Write a new 'D Dynamic Code Library'. My best guess for how this
> should work is this;
> 
> a) This library will statically link to GDC and the GCC back end.
> b) On first invokation, look for a metadata file. If it isn't present or
> is out of date, recompile the app from source and keep the relevant
> GDC state in memory. Otherwise, deserialise GDC state from the metadata
> file.
> c) Supply functions that accept D source and possibly one or more of the
> GCC intermediate representations (i.e. AST, CFG, RTL). These would call
> through to GDC/GCC, intercept the output and write it into the code
> page, then perform runtime linking as necessary. This is probably a lot
> of work, but I am unsure exactly how much.
> d) Open source this library in case anyone else wants to use it.

Wow.

> Any thoughts on any of the above? Bear in mind that I have a mild bias
> towards using D even if it is some extra work to get things going, but I
> am certainly not going to accept worse overall performance than the Java
> version.

I think that any D path you take with this will be of a large, sustained 
benefit to the D community.