Relaxing the definition of isSomeString and isNarrowString
Andrei Alexandrescu via Digitalmars-d
digitalmars-d at puremagic.com
Sun Aug 24 05:24:12 PDT 2014
On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
> On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:
>> Currently char[], wchar[], dchar[] and qualified variants fulfill the
>> requirements of isSomeString. Also, char[], wchar[] and qualified
>> variants fulfill the requirements of isNarrowString.
>>
>> Various algorithms in Phobos test for these traits to optimize away
>> UTF decoding where unnecessary.
>>
>> I'm thinking of relaxing the definitions to all types that fulfill the
>> following requirements:
>>
>> * are random access ranges
>> * element type is some character
>> * offer .ptr as a @system property that offers a pointer to the first
>> character
>>
>> This would allow us to generalize the notion of string and offer
>> optimizations for user-defined, not only built-in, strings. Thoughts?
> [...]
>
> Recently there has been another heated debate on Github about whether
> ranges should auto-decode at all:
>
> https://github.com/D-Programming-Language/phobos/pull/2423
>
> Jonathan is about to write up a DIP for phasing out auto-decoding. Given
> that, I think we need to decide which direction to take, lest we waste
> energy on something that will be going away soon.
I've read the thread now, thanks. I think that discussion has lost
perspective. There is talk about how decoding is slow, yet nobody (like,
literally not one soul) looked at measuring and optimizing decoding.
Everybody just assumes it's unacceptably slow; the irony is, it is, but
mostly because it's not engineered for speed.
Look e.g. at
https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074.
That's a memory allocation each and every time decodeImpl is called. I'm
not kidding. Take a look at http://goo.gl/p5pl3D vs.
http://goo.gl/YL2iFN. It's egregious. The entire function, which should
be small and nimble, is a behemoth. True, that function is only called
for non-ASCII sequences, but the caller code can definitely be improved
for speed (I see e.g. unnecessary pass of strings by reference etc). A
bunch of good work can be done to improve efficiency of decoding
tremendously.
If the DIP is based on Jonathan's comments in the pull request, I think
it's a waste of everybody's time. It's a complete rewiring of the
standard library fundamentals and a huge inconvenience for all users,
for a slight improvement of a few scenarios.
UTF strings are bidirectional ranges of variable-length encoded code
units. Algorithms work with those. That is the matter of fact, and what
makes UTF strings tick in D. It's the reason Python has a large schism
between 2 and 3 over UTF strings and D is doing well, thank you very
much. Considering D is faring a lot better than all languages I know of
in the UTF realm, if auto-decoding was a mistake, well it wasn't a bad one.
I suggest we focus efforts on solving real problems with creative
solution. Endless hand-wringing over small fish and forever redoing
what's already there is not the way to progress. Speaking of which, this
is why I discuss isNarrowString: I have recently gotten convinced of a
number of things. First, there will always be a category of users and
applications who will NOT, for reasons good and bad, accept garbage
collection. They just won't, and we can discuss how that sometimes is a
bummer reflecting on human nature, but I don't see how framing the
problem as a user education issue is pushing D forward.
It follows we need to improve proper use of D for people who will not
have the GC, by means of superior libraries. And if you think of it this
is very D-ish: we have a powerful, incredibly expressive language at our
hands, designed explicitly to make powerful abstractions easy to define.
So the right direction of exploration is better libraries.
One other thing I realized is that avoiding the GC by means of APIs
based on output ranges Will Not Work(tm). Not all allocations can be
successfully hoisted to the user in that manner. Output ranges are fine
if the sole focus is producing sequential output, but the reality is
that often more complex structures need to be allocated for things to
work, and the proper way to hoist that into the client is by providing
allocator services, not output ranges. Consider warp itself: it reads
files and writes files, which is all linear and nice, but part of the
processing involves storing macros, which are a complex internal data
structure (e.g. a hash table). The linear output can be hoisted nicely
to the client, but not the intricate allocation needed by macros.
Continuing to pull on that thread, it follows that we need to provide
and document thoroughly data structures that make resource management
derived from tried and true practices in C++: std::string,
std::unique_ptr, std::shared_ptr. For example, we have RefCounted. It
just works transparently in a bunch of situations, but very few users
actually know about it. Furthermore, it has imperfections and rough
edges. Improving RefCounted (better code, better documentation, maybe
DIPs prompting language improvements that make it better) would be real
progress.
To that end I'm working on RCString, an industrial-strength string type
that's much like string, just reference counted and with configurable
allocation. It's safe, too - user code cannot casually extract
references to string internals. By default allocation would use GC's
primitives; one thing I learned to appreciate about our GC is its
ability to free memory explicitly, which means RCString will free memory
deterministically most of the time, yet if you leak some (e.g. by having
RCString class members) the GC will pick up the litter. I think
reference counting backed up by a GC that lifts litter and cycles and is
a modern, emergent pattern that D could use to great effect.
(Speaking of which: Some, but not all, types in std.container use
reference counting. One other great area of improvement would be to
guarantee that everything is std.container is reference counted.
Containers are the perfect candidate for reference counting - they are
typically large enough to make the reference counting overhead
negligible by comparison with the typical work on them.)
I'd like RCString to benefit of the optimizations for built-in strings,
and that's why I was looking at relaxing isSomeString/isNarrowString.
Andrei
More information about the Digitalmars-d
mailing list