So this is the best so far version: http://dpaste.dzfl.pl/8dae9b359f27 I don't show the version with manually inlined function. (I have also seen that GCC generates on my cpu a little faster code if I don't use sse registers.) Bye, bearophile