Array operations, C#, etc

Mon Nov 3 17:01:31 PST 2008

Mono for gaming and higher performance:
http://tirania.org/tmp/PC54-slides-as-pdf.pdf
Link coming from this blog post:
http://tirania.org/blog/archive/2008/Nov-03.html

The article (well, slide set) shows how C# (on mono) is used to replace scripting languages like Lua/Python for IA in games (D isn't listed there, maybe they think D is a dinosaur like C++).

Near the end the slide set also shows the the approach taken by mono to use the SIMD instructions of the CPU, defining many types like:

Mono.Simd.Vector16b  - 16 unsigned bytes
Mono.Simd.Vector16sb - 16 signed bytes
Mono.Simd.Vector2d   - 2 doubles
Mono.Simd.Vector2l   - 2 signed 64-bit longs
Mono.Simd.Vector2ul  - 2 unsigned 64-bit longs
Mono.Simd.Vector4f   - 4 floats
etc...

Operations on them become translated as SIMD instructions.

D instead augments all arrays with array operations, but then it also has to manage the cases where lengths aren't exact multiples of the MMX registers.

I think that such length management is done at runtime, so you have to pay a little price if you have just few items, like 4 floats, that for example you don't pay using Mono.Simd.Vector4f.

When array sizes are known at compile time and fixed, like in this situation:

void main() {
  float[4] a = [1.0, 2.0, 3.0, 4.0];
  float[4] b = [10.0, 20.0, 30.0, 40.0];
  float[4] s;
  s[] = a[] + b[];
}

The compiler, into the arrayfloat._arraySliceSliceAddSliceAssign_f() function can use compile-time information to runtime length controls & fallbacks, using some static ifs.

With the purpose is to make it produce only the naked instuctions in that case (I may write a little benchmark in D with inlined ASM to compare the speed of the s[]=a[]+b[] line):

movups (%eax),%xmm0
movups (%edi),%xmm1
addps %xmm1,%xmm0
movups %xmm0,(%eax)

I think another little and less easy to solve problem comes from this control near the beginning of that _arraySliceSliceAddSliceAssig function:
if (sse() && ...
That is probably quick, but if you have to sum just 4 floats into a loop I presume it may slow down the code some (there's also the function call, it's not inlined).
The availability of sse() can't be done a compile time because you don't know where the code will run (but eventually a compiler argument can be added to specify the program will be run only on CPUs with SSE2, etc). A brutal solution is to duplicate the object code of the functions that contain SSE instructions, so at the beginning of the runtime you can change the jump pointers of the function calls in the whole programs once :-) It may make the executable longer, but seeing how execs are often 300+KB I don't think that's a big problem, and I presume only few functions will contain array operations.

Bye,
bearophile