2016Q1: std.blas

Sat Dec 26 11:57:19 PST 2015

Hi,

I will write GEMM and GEMV families of BLAS for Phobos.

Goals:
  - code without assembler
  - code based on SIMD instructions
  - DMD/LDC/GDC support
  - kernel based architecture like OpenBLAS
  - 85-100% FLOPS comparing with OpenBLAS (100%)
  - tiny generic code comparing with OpenBLAS
  - ability to define user kernels
  - allocators support. GEMM requires small internal allocations.
  - @nogc nothrow pure template functions (depends on allocator)
  - optional multithreaded
  - ability to work with `Slice` multidimensional arrays when 
stride between elements in vector is greater than 1. In common 
BLAS matrix strides between rows or columns always equals 1.

Implementation details:
LDC     all   : very generic D/LLVM IR kernels. AVX/2/512/neon 
support is out of the box.
DMD/GDC x86   : kernels for  8 XMM registers based on core.simd
DMD/GDC x86_64: kernels for 16 XMM registers based on core.simd
DMD/GDC other : generic kernels without SIMD instructions. 
AVX/2/512 support can be added in the future.

References:
[1] Anatomy of High-Performance Matrix Multiplication: 
http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
[2] OpenBLAS  https://github.com/xianyi/OpenBLAS

Happy New Year!

Ilya