Matrix mul

Sun Nov 23 12:37:15 PST 2008

On Sun, Nov 23, 2008 at 6:15 PM, bearophile <bearophileHUGS at lycos.com> wrote:
> Bill Baxter:
>> Exactly.  That's why I haven't spent too much time benchmarking it.
>> It would be quite surprising if something I wrote in D outperformed
>> the ATLAS SSE3 optimized BLAS implementation.
>
> Performing many benchmarks teaches you that it's better not assume too much things. Nature and computers often find ways to surprise you :-)

I do find all your benchmark postings interesting.  I'd heard the DMD
fp codegen wasn't so great, but I'm quite convinced now.

In this particular case I just haven't had any performance problems
with my setup yet, so I haven't felt the need to investigate.  I
needed various LAPACK routines anyway, and LAPACK depends on BLAS, so
BLAS is just sitting there.  I might as well use it. Anyway I gave it
a try, and it does multiply the 400x400 matrices about 6.7x faster
than the fastest result from the naive mult implementation in your
benchmark.  However it should be noted that BLAS only accepts
contiguous arrays, so I only tried it on your V2 allocation strategy
that allocates one big array.

--bb