Relaxing the definition of isSomeString and isNarrowString

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Sun Aug 24 05:24:12 PDT 2014


On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
> On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:
>> Currently char[], wchar[], dchar[] and qualified variants fulfill the
>> requirements of isSomeString. Also, char[], wchar[] and qualified
>> variants fulfill the requirements of isNarrowString.
>>
>> Various algorithms in Phobos test for these traits to optimize away
>> UTF decoding where unnecessary.
>>
>> I'm thinking of relaxing the definitions to all types that fulfill the
>> following requirements:
>>
>> * are random access ranges
>> * element type is some character
>> * offer .ptr as a @system property that offers a pointer to the first
>> character
>>
>> This would allow us to generalize the notion of string and offer
>> optimizations for user-defined, not only built-in, strings. Thoughts?
> [...]
>
> Recently there has been another heated debate on Github about whether
> ranges should auto-decode at all:
>
> 	https://github.com/D-Programming-Language/phobos/pull/2423
>
> Jonathan is about to write up a DIP for phasing out auto-decoding. Given
> that, I think we need to decide which direction to take, lest we waste
> energy on something that will be going away soon.

I've read the thread now, thanks. I think that discussion has lost 
perspective. There is talk about how decoding is slow, yet nobody (like, 
literally not one soul) looked at measuring and optimizing decoding. 
Everybody just assumes it's unacceptably slow; the irony is, it is, but 
mostly because it's not engineered for speed.

Look e.g. at 
https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. 
That's a memory allocation each and every time decodeImpl is called. I'm 
not kidding. Take a look at http://goo.gl/p5pl3D vs. 
http://goo.gl/YL2iFN. It's egregious. The entire function, which should 
be small and nimble, is a behemoth. True, that function is only called 
for non-ASCII sequences, but the caller code can definitely be improved 
for speed (I see e.g. unnecessary pass of strings by reference etc). A 
bunch of good work can be done to improve efficiency of decoding 
tremendously.

If the DIP is based on Jonathan's comments in the pull request, I think 
it's a waste of everybody's time. It's a complete rewiring of the 
standard library fundamentals and a huge inconvenience for all users, 
for a slight improvement of a few scenarios.

UTF strings are bidirectional ranges of variable-length encoded code 
units. Algorithms work with those. That is the matter of fact, and what 
makes UTF strings tick in D. It's the reason Python has a large schism 
between 2 and 3 over UTF strings and D is doing well, thank you very 
much. Considering D is faring a lot better than all languages I know of 
in the UTF realm, if auto-decoding was a mistake, well it wasn't a bad one.

I suggest we focus efforts on solving real problems with creative 
solution. Endless hand-wringing over small fish and forever redoing 
what's already there is not the way to progress. Speaking of which, this 
is why I discuss isNarrowString: I have recently gotten convinced of a 
number of things. First, there will always be a category of users and 
applications who will NOT, for reasons good and bad, accept garbage 
collection. They just won't, and we can discuss how that sometimes is a 
bummer reflecting on human nature, but I don't see how framing the 
problem as a user education issue is pushing D forward.

It follows we need to improve proper use of D for people who will not 
have the GC, by means of superior libraries. And if you think of it this 
is very D-ish: we have a powerful, incredibly expressive language at our 
hands, designed explicitly to make powerful abstractions easy to define. 
So the right direction of exploration is better libraries.

One other thing I realized is that avoiding the GC by means of APIs 
based on output ranges Will Not Work(tm). Not all allocations can be 
successfully hoisted to the user in that manner. Output ranges are fine 
if the sole focus is producing sequential output, but the reality is 
that often more complex structures need to be allocated for things to 
work, and the proper way to hoist that into the client is by providing 
allocator services, not output ranges. Consider warp itself: it reads 
files and writes files, which is all linear and nice, but part of the 
processing involves storing macros, which are a complex internal data 
structure (e.g. a hash table). The linear output can be hoisted nicely 
to the client, but not the intricate allocation needed by macros.

Continuing to pull on that thread, it follows that we need to provide 
and document thoroughly data structures that make resource management 
derived from tried and true practices in C++: std::string, 
std::unique_ptr, std::shared_ptr. For example, we have RefCounted. It 
just works transparently in a bunch of situations, but very few users 
actually know about it. Furthermore, it has imperfections and rough 
edges. Improving RefCounted (better code, better documentation, maybe 
DIPs prompting language improvements that make it better) would be real 
progress.

To that end I'm working on RCString, an industrial-strength string type 
that's much like string, just reference counted and with configurable 
allocation. It's safe, too - user code cannot casually extract 
references to string internals. By default allocation would use GC's 
primitives; one thing I learned to appreciate about our GC is its 
ability to free memory explicitly, which means RCString will free memory 
deterministically most of the time, yet if you leak some (e.g. by having 
RCString class members) the GC will pick up the litter. I think 
reference counting backed up by a GC that lifts litter and cycles and is 
a modern, emergent pattern that D could use to great effect.

(Speaking of which: Some, but not all, types in std.container use 
reference counting. One other great area of improvement would be to 
guarantee that everything is std.container is reference counted. 
Containers are the perfect candidate for reference counting - they are 
typically large enough to make the reference counting overhead 
negligible by comparison with the typical work on them.)

I'd like RCString to benefit of the optimizations for built-in strings, 
and that's why I was looking at relaxing isSomeString/isNarrowString.


Andrei



More information about the Digitalmars-d mailing list