Relaxing the definition of isSomeString and isNarrowString
Dmitry Olshansky via Digitalmars-d
digitalmars-d at puremagic.com
Sun Aug 24 06:08:28 PDT 2014
24-Aug-2014 16:24, Andrei Alexandrescu пишет:
> On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
>> On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via
>> Digitalmars-d wrote:
>>> Currently char[], wchar[], dchar[] and qualified variants fulfill the
>>> requirements of isSomeString. Also, char[], wchar[] and qualified
>>> variants fulfill the requirements of isNarrowString.
>>>
>>> Various algorithms in Phobos test for these traits to optimize away
>>> UTF decoding where unnecessary.
>>>
>>> I'm thinking of relaxing the definitions to all types that fulfill the
>>> following requirements:
>>>
>>> * are random access ranges
>>> * element type is some character
>>> * offer .ptr as a @system property that offers a pointer to the first
>>> character
>>>
>>> This would allow us to generalize the notion of string and offer
>>> optimizations for user-defined, not only built-in, strings. Thoughts?
>> [...]
>>
>> Recently there has been another heated debate on Github about whether
>> ranges should auto-decode at all:
>>
>> https://github.com/D-Programming-Language/phobos/pull/2423
>>
>> Jonathan is about to write up a DIP for phasing out auto-decoding. Given
>> that, I think we need to decide which direction to take, lest we waste
>> energy on something that will be going away soon.
>
> I've read the thread now, thanks. I think that discussion has lost
> perspective. There is talk about how decoding is slow, yet nobody (like,
> literally not one soul) looked at measuring and optimizing decoding.
> Everybody just assumes it's unacceptably slow; the irony is, it is, but
> mostly because it's not engineered for speed.
I'd be damned, if I didn't measure. Do not assume things if there is
little actual data.
Take a peek at my tests (yeah, rough stuff but it does the job).
https://github.com/DmitryOlshansky/gsoc-bench-2012
Let's count number of codepoints in a file (loading it into memory
before measuring stuff):
===================
decode-only [arwiki], 368312 hits, 3.16312, 205.58 Mb/s
noop [arwiki], 599328 hits, 0.67872, 958.11 Mb/s
====================
decode-only [ruwiki], 523414 hits, 4.54544, 206.23 Mb/s
noop [ruwiki], 884374 hits, 0.7898, 1186.92 Mb/s
====================
decode-only [dewiki], 1773464 hits, 5.63744, 318.28 Mb/s
noop [dewiki], 1607454 hits, 2.1322, 841.52 Mb/s
With that I can say that decode *is* engineered for speed.
It processes gets about 200-300 millions of codepoints per second for
me. Compared with baseline of 840 Mbytes/sec of testing if each byte is
> 0x20.
With LDC both numbers are about 2 times better but the picture stays the
same.
As you can see there is a significant gap there and decoding is not yet
memory-bound. It's still a cost that hinders top-performance code.
> Look e.g. at
> https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074.
> That's a memory allocation each and every time decodeImpl is called. I'm
> not kidding. Take a look at http://goo.gl/p5pl3D vs.
You are completely wrong on this count. I'm horrified to learn that even
D experts are unaware of how `enum` works or you'd got to be kidding me.
enum doesn't allocate but serves as `copy-paste` token effectively
putting said literal everywhere it is used.
Now look at usage pattern:
https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1173
(where 'i' is constant from static foreach)
Compiler sees it as:
[(1 << 7) - 1, (1 << 11) - 1, (1 << 16) - 1, (1 << 21) - 1][0]
and const-folds it accordingly.
> I'd like RCString to benefit of the optimizations for built-in strings,
> and that's why I was looking at relaxing isSomeString/isNarrowString.
>
On this I agree, isSomeString is an ugly half-assed hack if we can't
have other kinds of strings.
Also consider Array!char which in fact should be pretty close to RCString.
I'm of opinion that there are 2 ways out of 'decode by default sucks':
a) Make it finally generic and clearly state what a DecodableRange is.
Stick with that everywhere.
b) To hell with that and just remove implicit decoding.
Else it really-really sucks.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list