TypeFunction example: ImplictConvTargets

Wed Oct 7 09:09:24 UTC 2020

On Wednesday, 7 October 2020 at 04:17:59 UTC, H. S. Teoh wrote:
> On Tue, Oct 06, 2020 at 08:47:33PM -0700, Walter Bright via 
> Digitalmars-d wrote:
>> On 10/6/2020 11:09 AM, Adam D. Ruppe wrote:
>> > The Phobos implementation started life with a very simple 
>> > implementation too. It became what it is because it *had 
>> > to*, specifically for performance reasons.
>> 
>> Professional C Standard library implementations tend to be 
>> hideous code to perform objectively simple operations. The 
>> reason is speed is so desirable in foundational code that it 
>> drives out all other considerations. (memcpy() is a standout 
>> example.)
>
> A little tangential aside: one time as a little coffee break 
> challenge, I decided to see how D compares to GNU wc in terms 
> of the speed of counting lines in a text file.  The core of wc, 
> of course, is in memchr -- because it's scanning for newline 
> characters in a buffer.  In the course of ferreting out what 
> made wc so fast, I studied how GNU libc implemented memchr.  
> Basically, in order to speed up scanning large buffers, it uses 
> a series of fancy bit operations on 64-bit words in order to 
> scan 8 bytes at a time, I suppose the goal being to achieve 
> close to 8x speedup, and also to reduce the number of branches 
> per iteration to capitalize on the CPU's pipeline.
>
> In order to do this, however, some not-so-cheap setup was 
> necessary at the beginning and end of the buffer, which 
> generally are not 8-byte aligned.  When a particular call to 
> memchr scanned many bytes, of course, the speed of scanning 8 
> bytes at a time outweighed this setup / teardown cost, and wc 
> generally outperformed my D code.
>
> However, when the lines being scanned were on the short side, 
> the overhead of setting up the 8-byte-at-a-time scanning added 
> up significantly -- the shorter lines meant less time was spend 
> scanning 8 bytes at a time, and more time spent in the 
> complicated code dealing with non-aligned start/end of the 
> buffer. In this case, I discovered that a simplistic for-loop 
> in D outperformed wc.
>
> This led me to think, realistically speaking, how long do lines 
> in a text file tend to be?  My wild guess is that they tend to 
> be on the shortish side -- maybe about 60-70 characters on 
> average? For code, a lot less, since code generally has more 
> whitespace for readability reasons.  Files that have very long 
> lines tend to be things like XML or compressed Javascript, 
> which generally you don't really use wc on anyway, so in my 
> mind, they seem to be more the exception than the norm when it 
> comes to the applicability of wc.  Furthermore, I venture to 
> propose that your average string length in C is probably closer 
> to the 80-120 character ballpark, than to large buffers of 1K 
> or 8K which is where the 8-byte-at-a-time scanning would 
> perform much better. Sure, large text buffers do get handled in 
> C code routinely; but how many of them realistically would you 
> find being handed to strchr to scan for some given byte?  The 
> kind of C strings you'd want to search for a particular byte 
> in, IME, are usually the shorter kind.
>
> What this means, is that yes memchr may be "optimized" to next 
> week and back -- but that optimization came with some implicit 
> assumptions about how long it takes to find a character in a 
> string buffer. These assumptions unfortunately pessimized the 
> common case with the overhead of a fancy 8-byte-at-a-time 
> algorithm: one that does better in the *less* common case of 
> looking for a particular byte in a very large buffer.
>
> My point behind all this, is that what's considered optimal 
> sometimes changes depending on what kind of use case you 
> anticipate; and with different usage patterns, one man's 
> optimal algorithm may be another man's suboptimal algorithm, 
> and one man's slow, humble for-loop may be another man's speed 
> demon.  Optimizing for the general case in a standard library 
> is generally a very tough problem, because how do you make any 
> assumptions about the general case?  Whichever way you choose 
> to optimize a primitive like memchr, you're imposing additional 
> assumptions that may make it worse for some people, even if you 
> think that you're improving it for your chosen target metric.
>

There's a counter argument to your example (which doesn't 
invalidate yout point).

The pessimisation for short lines is barely noticeable and is 
drowned in the noise of process launching and even if you can 
measure it consistantly and easily, it doesn't matter much even 
if happening a lot of times (in a batch job f.ex). The 
optimization for long line though, will be noticeable immediately 
because of the inherent processing time of the operation. If an 
operation of 5ms takes now 10ms, it doesn't make a difference. If 
an operation that took 1 minute now takes 30s it is a big deal.

>
>
>> I remember back in the 80's when Borland came out with Turbo 
>> C. The compiler didn't generate very good code, but 
>> applications built with it were reasonably fast. How was this 
>> done?
>> 
>> Borland carefully implemented the C library in hand-optimized 
>> assembler by some very good programmers. Even printf was coded 
>> this way. The speedups gained there sped up every Turbo C 
>> program. At the time, nobody else had done that.
>> 
>> Much as I hate to admit it, Borland made the smart move there.
>
> So what does this imply in terms of Phobos code in D? ;-)  
> Should we uglify Phobos for the sake of performance?  Keeping 
> in mind, of course, what I said about the difficulty of 
> optimizing for the general case without pessimizing some 
> legitimate use cases -- or maybe even the *common* case, if we 
> lose sight of the forest of common use cases while trying to 
> optimize our chosen benchmark trees.

My point. Some optimlizations are more about mitigating speed 
degradation on pathological cases than on the normal general 
case, or else the whole O() notation would be of no importance 
(I'm sure the symbol name compression scheme in the D compiler 
pessimized the common case but saved the language from the 
pathological cases of recursive symbols).