std.algorithm.copy could be better optimized

Mon Jun 17 18:22:16 PDT 2013

On 6/17/13 6:14 PM, TommiT wrote:
> On Monday, 17 June 2013 at 17:43:25 UTC, Andrei Alexandrescu wrote:
>> On 6/17/13 1:36 PM, TommiT wrote:
>>> std.algorithm.copy adheres to the specification of the C++ standard
>>> library function std::copy. That specification states that the source
>>> and target ranges may not overlap. Yet, all the major C++ standard
>>> libraries (libc++, libstdc++, Dinkum C++ Library) implement std::copy so
>>> that it becomes a memmove if the source and target element types are the
>>> same and the element type is std::is_trivially_copy_assignable. Thus,
>>> the current C++ specification seems to be too strict.
>>>
>>> The current std.algorithm.copy implementation doesn't get optimized into
>>> a memmove when it would be safe to do so. It really should, because it's
>>> the fastest way to get the job done, and it allows the ranges to
>>> overlap. Also, the documentation should be changed to notify of this
>>> relaxation.
>>>
>>> There's some benchmarks of memmove vs other methods:
>>> http://nadeausoftware.com/articles/2012/05/c_c_tip_how_copy_memory_quickly
>>
>> Bugzilla. FWIW I did notice Duff can sometimes beat memmove, but that was some 10 years ago.
>>
>> Andrei
>
> Another thing that comes to mind is that...
> R find(alias pred = "a == b", R, E)(R haystack, E needle)
> ...could be optimized, when pred is "a == b", R is an array of E's, and E is either char, byte or
> ubyte, to call the c function memchr instead. It's a bit tricky though: quite recently a bug was
> found in this particular optimization in the standard library that ships with Visual Studio, as
> described in this interesting tutorial:
>
> http://channel9.msdn.com/Series/C9-Lectures-Stephan-T-Lavavej-Core-C-/Stephan-T-Lavavej-Core-C-7-of-n
>
> More generally, what I'm saying is that we should probably just go through some good C++ standard
> library implementation and copy all the optimizations that they have done. I'm sure there's much more.

In my experience, this is an area that ends up being deferred over to the compiler (via a builtin or 
direct use of mem* that the compiler is just smart enough to recognize and implement itself) to let 
it generate the optimal code.  The thresholds for inline vs function call and myriad of optimization 
techniques end up being super platform specific and duplicating all that complication essentially 
guarantees that you're not going to do as good a job.  As an example, there's people at intel that 
are paid to do little more than improve the code generation in gcc and often submit new versions of 
the mem* routines both to gcc and to glibc.  It's actually amazing how often that code is tweaked 
for both older generations of cpus and to add new ones.

- Brad