Worst Phobos documentation evar!

Fri Jan 2 02:05:54 PST 2015

On 2 January 2015 at 08:26, Walter Bright via Digitalmars-d
<digitalmars-d at puremagic.com> wrote:
> On 1/1/2015 6:43 AM, Manu via Digitalmars-d wrote:
>>>
>>> Make it work in dmd (with my help, of course) and prove the design. Then
>>> GDC
>>> will come along.
>>
>>
>> I can't do anything that isn't supported by the GCC backend; that's
>> the platform for almost all the cross-compilers, which I really care
>> about.
>> I also have a suspicion that anybody who uses SIMD seriously will
>> probably be building with LDC or GCC because of improved codegen
>> across the board.
>
>
> Have to start somewhere.

Or, it might rather just be best to drop such ambitious goals for the
API from the first revision :/

>>> Half-float is here:
>>>
>>> http://digitalmars.com/sargon/halffloat.html
>>>
>>> in the Sargon component library :-)
>>
>>
>> That's probably not the best place for it ;)
>> What was the controversy blocking it? I don't remember.
>
>
> I think the root issue was nobody thought it was useful other than you and
> I. There is a reasonable issue about "should this really be in the Standard
> Library, or an add-on?"
>
> I created Sargon as a place to put things I think would be generally useful,
> but have been unable to convince others of.

std.simd would ideally depend on it. half8 is a very useful SIMD primitive.
It might be tricky to define a half8 type in std.simd alongside
float4, short8, etc, that feels sufficiently related to the others and
emits hardware opcodes.

>>> It would be trivial to use halffloat as a model for building other custom
>>> floating point types.
>>
>>
>> Isn't there a competing custom float implementation though?
>> Should they be unified?
>
>
> I don't believe in the utility of a customizable float implementation. There
> is one, but it is pretty much useless. For example, halffloat cannot be
> implemented using it.

Why?

Well could custom float alternatively be implemented by generalising halffloat?
But I agree that it's basically useless. Half float on the other hand
is a standard type that appears in geometry and image data all the
time these days.

>>>> That stuff all needs forceinline too to be particularly useful.
>>>
>>>
>>>
>>> I agree, but that does NOT block you from designing and writing the code.
>>> It
>>> only blocks the final polish.
>>
>>
>> I wrote the code years ago. I was at the point of polish where it got
>> stuck on those issues... and unittest's, and documentation >_<
>
>
> The last 10% of the work is always 90% of the work. Get it done!
>
>
>
>>> It also may be worthwhile to look through std.algorithm and see what can
>>> be
>>> specialized to use simd, such as string searching.
>>
>>
>> SSE provides some special opcodes for string processing that aren't
>> really portable, so I don't expect those would be expressed in
>> std.simd, but they could be used opportunistically in other places.
>> The main trouble is that they are SSE4.2, which is still sufficiently
>> new that we can't rely on it's presence on end-user machines. How will
>> we tell the compiler it is okay to use those opcodes?
>> GCC/LDC have args to say what SSE level to build for. We need a way to
>> query the requested SSE level in the code, and we also need that
>> concept added to DMD.
>
>
> std.algorithm is portable for the user. It is very accommodating for
> specializations, so most definitely it can be specialized for x86 with SIMD.
>
> Some platforms, such as OSX, have a reliable minimum standard. This is why
> we, for example, have abandoned 32 bit OSX. For others, a runtime switch
> based on CPUID is fine.

I don't think a runtime switch is practical that far down the call-tree.
We need to be able to supply a SIMD level to the compiler, and
ideally, also change it locally with a pragma or an attribute (GCC and
Clang use an attribute).
Runtime SIMD branch selection needs to be positioned as high up the
stack as possible.

>> So, where it sits is: DMD needs a commandline arg to request an SSE
>> target level and made available to the code, I need to work out what
>> to do about GDC, and I need to write the unittest's and docs... which
>> is probably the biggest block tbh ;)
>> Otherwise, it still rests in my fork where I left it. Some people have
>> used it. When it's done I'll extend it to support matrices, and
>> perhaps some higher-level linear algebra stuff.
>
>
> The thing is, if you don't do the unittests and docs, you will be throwing
> away all the other work you've done. If you do them, there will be a large
> incentive to get forceinline working.

I just recalled another problem actually.

This:
  @uda(Arg) void function(Arg)()
  {
  }

I need to be able to supply a template arg to a UDA. The runtime cpuid
branching tech depends on it, and thus the whole API design depends on
it.

So, yeah they're all coming back to me now slowly :)
There were a number of things I was blocked on. If I can't prove that
my API approach works, then it needs to completely change.
These things were blocking me from proving the design actually works.

Can this one be addressed?

>> Oh yeah, and a massive one that I've discussed with Ethan and I think
>> he's discussed with you; DMD likes to use the x87 in win64 builds...
>> that's really no good. We can't be swapping between x87 and xmm regs.
>> float/double args are passed in xmm according to the ABI, and then
>> everywhere a float operation is performed, a swap from xmm -> x87 is
>> emitted, and then back again. It's madness. You're gonna have to let
>> the x87 go on x64 ;)
>
>
> That shouldn't affect vector code.

Well, it kinda does. Swapping register types is one of the biggest
performance hazards there is.
Use of SIMD opcodes will generate use of xmm regs. As much as I preach
best practise is to commit 100% to SIMD, people will inevitably find
themselves moving float data into SIMD regs. x64 should actually be
the most tolerant of that sort of code, because float data *should*
already be in xmm regs, but I think in our case we're going to see
some pretty ordinary simd benchmarks whenever floats appear. The
register swapping will very probably result in code that is slower
than just issuing sequential float ops.