seeding the pot for 2.0 features

Sat Jan 27 15:14:39 PST 2007

I'll bite.  Here are the two features I consider most important:

1. Dynamic Closures:
See: 
http://lists.puremagic.com/pipermail/digitalmars-d/2006-August/007520.html

2. Low dimensional vectors as primitive types

Specifically I would like to see the types int, real, float etc. 
extended into 2-4 dimensional vectors.  ie. int2, real4, float3.

This would be exceptionally useful in many applications which require 
coordinate geometry.  Here is a very brief list:

Scientific Programs
Physics Simulation
Computer Graphics
Video Games
User Interfaces
Computational Geometry
Robotics
Fluid Simulation

etc.

I would prefer not to recount the number of times I have written my own 
vector library, and how tedious they are to create.  For most every 
language I learn it is the first thing I need to write, since so few are 
willing to provide a default implementation.  In my opinion, this is 
unacceptable for a problem which occurs so frequently.

One option is to extend the standard library with a vector-types class, 
but this is not nearly as nice a compiler level implementation.

1. The 90 degrees rotation trick
This is based on the following article:
http://www.flipcode.com/articles/article_fastervectormath.shtml

The basic idea is we use templates to expand vector expressions to 
achieve better register use and memory locality.  Here is a simple example:

a = b + c * d;

Normally, we would translate this as something like:

a = b.opAdd(c.opMul(d));

Which gives the following assembler:

//Do opMul
mov 	reg, c.x
mul	reg, d.x
mov 	tmp.x, reg
mov 	reg, c.y
mul	reg, d.y
mov 	tmp.y, reg
mov 	reg, c.z
mul	reg, d.z
mov 	tmp.z, reg

//Do opAdd
mov 	reg, tmp.x
add	reg, b.x
mov 	a.x, reg
mov 	reg, tmp.y
add	reg, b.y
mov 	a.y, reg
mov 	reg, tmp.z
add	reg, b.z
mov 	a.z, reg

This is not particularly efficient, since it requires the creation of a 
temporary vector to store the result from c * d.  A better strategy 
involves 'rotating the computation 90 degrees' and performing the 
expression on a per-component level:

//x - component
mov	reg, c.x
mul	reg, d.x
add	reg, b.x
mov	a.x, reg

//y - component
mov	reg, c.y
mul	reg, d.y
add	reg, b.y
mov	a.y, reg

//z - component
mov	reg, c.z
mul	reg, d.z
add	reg, b.z
mov	a.z, reg

The performance improvement becomes more substantial the longer the 
expression.  Since overloaded operators do not instantiate templates, 
there is no obvious way to obtain this result in the current language spec.

2. Architecture specific optimizations (SIMD)

For low dimensional arithmetic, many architectures provide specially 
optimized instruction for low dimensional vectors.  The problem is most 
languages do not exploit them.  Creating efficient SIMD code is 
impossible for a library, since each opAdd/opMul must be written using 
inline assembler and therefore incurs the overhead of a function call 
regardless.  This is worsened by the fact that moving to/from a vector 
register is typically very expensive.

A compiler level implementation can easily avoid these issues by 
assigning vector expressions to a register when passing them.  Moreover 
it is more portable than compiler intrinsics like MSVC's SSE extensions. 
  The implementation can easily emit fallback code if the architecture 
does not support SIMD instructions.

3. Swizzles

A swizzle is a reordering of the elements in a vector.  Shader languages 
like Cg or GLSL typically support them, given their utility in certain 
types of computations.  Here are some examples:

v.xxxx 	// Returns a vector with v.x broadcast to all components
v.xyz	// Returns only the xyz components of v
v.zyx	// Returns a vector consisting of the reverse of v's xyz components

Enumerating all possible swizzles within a template is impossible, and 
therefore requires one function per swizzle.  The result is massive code 
bloat, and many lines of automatically generated gibberish.  To get an 
idea at how many functions this requires, the total number of swizzles 
for 2-4 component vectors is 4^4 + 4^3 + 4^2 + 4 or 340.  Multiply that 
by the number of primitive types and the result becomes quite large.

-Mik