utiliD: A library with absolutely no dependencies for bare-metal programming and bootstrapping other D libraries

Sat May 11 05:39:12 UTC 2019

On Sat, May 11, 2019 at 01:45:08AM +0000, Mike Franklin via Digitalmars-d-announce wrote:
[...]
> I think this thread is beginning losing sight of the larger picture.
> What I'm trying to achieve is the opt-in continuum that Andrei
> mentioned elsewhere on this forum.  We can't do that with the way the
> compiler and runtime currently interact.  So, the first task, which
> I'm trying to get around to, is to convert runtime hooks to templates.
> Using the compile-time type information will allow us to avoid
> `TypeInfo`, therefore classes, therefore the entire D runtime.  We're
> now much closer to the opt-in continuum Andrei mentioned previously on
> this forum.  Now let's assume that's done...

Yes, that's definitely a direction we want to head in.  I think it will
be very beneficial.

> Those new templates will eventually call a very few functions from the
> C standard library, memcpy being one of them.  Because the runtime
> hooks are now templates, we have type information that we can use in
> the call to memcpy.  Therefore, I want to explore implementing `void
> memcpy(T)(ref T dst, const ref T src) @safe, nothrow, pure, @nogc`
> rather than `void* memcpy(void*, const void*, size_t)`  There are some
> issues here such as template bloat and compile times, but I want to
> explore it anyway.  I'm trying to imagine, what would memcpy in D look
> like if we didn't have a C implementation clouding narrowing our
> imagination.  I don't know how that will turn out, but I want to
> explore it.

Put this way, I think that's a legitimate area to explore. But copying a
block of memory from one place to another is simply just that, copying a
block of memory from one place to another.  It just boils down to how to
copy N bytes from A to B in the fastest way possible. For that, you just
reduce it to moving K words (the size of which depends only on the
target machine, not the incoming type) of memory from A to B, plus or
minus a few bytes at the end for non-aligned data. The type T only
matters if you need to do type-specific operations like call default
ctors / dtors, but at the memcpy level that should already have been
taken care of by higher-level code, and it isn't memcpy's concern what
ctors/dtors to invoke.

The one thing knowledge of T can provide is whether or not T[] can be
unaligned. If T.sizeof < machine word size, then you need extra code to
take care of the start/end of the block; otherwise, you can just go
straight to the main loop of copying K words from A to B. So that's one
small thing we can take advantage of. It could save a few cycles by
avoiding a branch hazard at the start/end of the copy, and making the
code smaller for inlining.

Anything else you optimize on copying K words from A to B would be
target-specific, like using vector ops, specialized CPU instructions,
and the like. But once you start getting into that, you start getting
into the realm of whether all the complex setup needed for, e.g., a
vector op is worth the trouble if T.sizeof is small. Perhaps here's
another area where knowledge of T can help (if T is small, just use a
naïve for-loop; if T is sufficiently large, it could be worth incurring
the overhead of setting up vector copy registers, etc., because it makes
copying the large body of T faster).

So potentially a D-based memcpy could have multiple concrete
implementations (copying strategies) that are statically chosen based on
the properties of T, like alignment and size.

[...]
> However, DMD won't do the right thing.

Honestly, at this point I don't even care.

> I guess others are thinking that we'd just re-implement `void*
> memcpy(void*, const void*, size_t)` in D and we'd throw in a runtime
> call to `memcpy(&dstArray[0], &srcArray[0], T.sizeof())`.  That's
> ridiculous.  What I want to do is use the type information to generate
> an optimal implementation (considering size and alignment) that DMD
> will be forced to inline with `pragma(inline)`.

It could be possible to select multiple different memcpy implementations
by statically examining the properties of T.  I think that might be one
advantage D could have over just calling libc's memcpy.  But you have to
be very careful not to outdo the compiler's optimizer so that it doesn't
recognize it as memcpy and fails to apply what would otherwise be a
routine optimization pass.

> That implementation can also take into consideration target features
> such as SIMD.  I don't believe the code will be complex, and I expect
> it to perform at least as well as the C implementation.  My initial
> tests show that it will actually outperform the C implementation, but
> that could be a problem with my tests.  I'm still researching it.

Actually, if you want to compete with the C implementation, you might
find that things could get quite hairy. Maybe not with memcpy, but other
functions like memchr have very clever hacks to speed it up that you
probably wouldn't think of without reading C library source code. There
may also be subtle differences that change depending on the target; it
used to be that `rep movsd` was faster in spite of requiring more
overhead setting up; but last I read, newer CPUs seem to have `rep
movsd` perform rather poorly whereas a plain ole for-loop actually
outperforms `rep movsd`.  At a certain point, this just begs the
question "should I just let the compiler's backend do its job by telling
it plainly that I mean memcpy, or should I engage in asm-hackery because
I'm confident I can outdo the compiler's codegen?".

One thing that might be worth considering is for the *compiler* to
expose a memcpy intrinsic, and then let the compiler decide how best to
implement it (using its intimate knowledge of the target machine arch),
rather than trying to do it manually in library code.

> Now assuming that's done, we now have language runtime implementations
> that are isolated from heavier runtime features (like the `TypeInfo`
> classes) that can easily be used in -betterC builds, bare-metal
> systems programming, etc. simply by importing them as a header-only
> library; it doesn't require first compiling (or cross-compiling) a
> runtime for linking with your program; you just import and go.  We're
> now much closer to the opt-in continuum.
> 
> Now what about development of druntime itself.  Well wouldn't it be
> nice if we could utilize things like `std.traits`, `std.meta`,
> `std.conv`, and a bunch of other stuff from Phobos?

Based on what Andrei has voiced, the way to go would be to merge Phobos
and druntime into one, by making Phobos completely opt-in so that you
don't pay for what you don't use from the heavier / higher-level parts
of Phobos.  At a certain point it becomes clear that the division
between Phobos and druntime is artificial, the result of historical
accident, and not a logical necessity that we have to keep. If Phobos is
made completely pay-as-you-go, the distinction becomes completely
irrelevant and the two might as well be merged into one.

> Wouldn't it also be nice if we could use that stuff in DMD itself
> without importing Phobos?  So let's take that stuff in Phobos that
> doesn't need druntime and put them in a library that doesn't require
> druntime (i.e. utiliD).  Now druntime can import utiliD and have more
> idiomatic-D implementations.

See, this trouble is caused by the artificial boundary between Phobos
and druntime.  We should look into breaking down this barrier, not
enforcing it.

> But the benefits don't stop there, bare-metal developers,
> microcontroller developers, kernel driver developers, OS developers,
> etc... can all use the runtime-less library to bootstap their own
> implementations without having to re-invent or copy code out of Phobos
> and druntime.
[...]

I think the logical goal is to make Phobos completely pay-as-you-go.
IOW, an actual *library*, as opposed to a tangled hairball of
dependencies that always comes with strings attached (can't import one
small thing without pulling in the rest of the hairball). A library is
supposed to be a set of resources which you can draw from as needed.
Pulling out one book (module) should not require pulling out half the
library along with it.

T

-- 
Once the bikeshed is up for painting, the rainbow won't suffice. -- Andrei Alexandrescu